python


Trying to mimick Scikit ngram with gensim


I'm trying to mimick the n_gram parameter in CountVectorizer() with gensim. My goal is to be able to use LDA with Scikit or Gensim and to find very similar bigrams.
For example, we can find the following bigrams with scikit: "abc computer", "binary unordered" and with gensim "A survey", "Graph minors"...
I have attached my code below to make a comparison between Gensim and Scikit in terms of bigrams/unigrams.
Thanks for your help
documents = [["Human" ,"machine" ,"interface" ,"for" ,"lab", "abc" ,"computer" ,"applications"],
["A", "survey", "of", "user", "opinion", "of", "computer", "system", "response", "time"],
["The", "EPS", "user", "interface", "management", "system"],
["System", "and", "human", "system", "engineering", "testing", "of", "EPS"],
["Relation", "of", "user", "perceived", "response", "time", "to", "error", "measurement"],
["The", "generation", "of", "random", "binary", "unordered", "trees"],
["The", "intersection", "graph", "of", "paths", "in", "trees"],
["Graph", "minors", "IV", "Widths", "of", "trees", "and", "well", "quasi", "ordering"],
["Graph", "minors", "A", "survey"]]
With the gensim model we find 48 unique tokens, we can print the unigram/bigrams with print(dictionary.token2id)
# 1. Gensim
from gensim.models import Phrases
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(documents, min_count=1)
for idx in range(len(documents)):
for token in bigram[documents[idx]]:
if '_' in token:
# Token is a bigram, add to document.
documents[idx].append(token)
documents = [[doc.replace("_", " ") for doc in docs] for docs in documents]
print(documents)
dictionary = corpora.Dictionary(documents)
print(dictionary.token2id)
And with the scikit 96 unique tokens, we can print scikit's vocabulary with print(vocab)
# 2. Scikit
import re
token_pattern = re.compile(r"\b\w\w+\b", re.U)
def custom_tokenizer( s, min_term_length = 1 ):
"""
Tokenizer to split text based on any whitespace, keeping only terms of at least a certain length which start with an alphabetic character.
"""
return [x.lower() for x in token_pattern.findall(s) if (len(x) >= min_term_length and x[0].isalpha() ) ]
from sklearn.feature_extraction.text import CountVectorizer
def preprocess(docs, min_df = 1, min_term_length = 1, ngram_range = (1,1), tokenizer=custom_tokenizer ):
"""
Preprocess a list containing text documents stored as strings.
doc : list de string (pas tokenizé)
"""
# Build the Vector Space Model, apply TF-IDF and normalize lines to unit length all in one call
vec = CountVectorizer(lowercase=True,
strip_accents="unicode",
tokenizer=tokenizer,
min_df = min_df,
ngram_range = ngram_range,
stop_words = None
)
X = vec.fit_transform(docs)
vocab = vec.get_feature_names()
return (X,vocab)
docs_join = list()
for i in documents:
docs_join.append(' '.join(i))
(X, vocab) = preprocess(docs_join, ngram_range = (1,2))
print(vocab)
gensim Phrases class is designed to "Automatically detect common phrases (multiword expressions) from a stream of sentences."
So it only gives you bigrams that "appear more frequently than expected". That's why with gensim package you only get a few bigrams like : 'response time', 'Graph minors', 'A survey'.
If you look at bigram.vocab you'll see that these bigrams appear 2 times whereas all others bigrams appear only one time.
scikit-learn's CountVectorizer class gives you all bigrams.

Related Links

alembic revision - multiple heads (due branching) error
Python set value multiindex Pandas
Multiplication table for double digit numbers using nested loops in Python
pandas time_range does not start from start date
Django How about my method to solve anti spam post request? Is there better solutions?
How can I write a csv file with multiple header lines with pandas to_csv()?
Connect to RDS from EC2 instance with Python
Make a function that extract value from a list and use it to match with another list that has a tuple in it
Memory address when function handle called in Python
Overwrite Property
Share file stream between Python and C
Russian String Formatting in Flowable
Replacing two elements of a list in place with a function [python 3]
PyParsing Parse nested loop with brace and specific header
Python: How to find the a unique element pattern from 2 arrays?
How disable buffering output of process, which was started in Python by use execl?

Categories

HOME
spring
ember.js
jenkins-plugins
elixir
dynamics-crm
boost-thread
label
visualization
iis-7.5
devstack
pygame
proguard
xcode8.3
google-spreadsheet-api
contact-form-7
goutte
dropbear
facebook-javascript-sdk
eclipse-cdt
rfid
android-source
cpanel
session-timeout
thumbnails
backpropagation
header-files
jquery-scrollify
compare-and-swap
mef2
ab-initio
google-api-dotnet-client
bitcoin-testnet
xor
clockwork
stocks
polymorphism
geo
rhino
honeysql
passenger
issue-tracking
flex4.5
rule
common.logging
jupyter-console
knockout-3.0
control-flow-graph
bing-translator-api
google-account
amazon-clouddrive
rapidweaver
vcf
crystal-reports-8.5
restivejs
svn-merge
hspec
c#-interactive
pydio
scalar
database-backups
nssplitview
inotifypropertychanged
selecteditem
axes
bluesnap
git-rebase
altbeacon
java.util.calendar
file-diffs
optionbutton
achartengine
ffserver
ibm-data-studio
google-plus-one
code-first-migrations
es2015
java-melody
cpu-speed
temp-tables
x-ua-compatible
log4cplus
mov
author
system.web
yahoo-boss-api
joox
ksoap2
healthvault
document-database
goinstant
iconv
device-emulation
template-haskell
osi
static-variables
parameterization
isolatedstorage
file-exists
law-of-demeter
wchar
printing-web-page
django-nose
goliath
adsl
msn
clients
linfu
scatterview
visual-studio-2010-beta-2
economics

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App