python


Trying to mimick Scikit ngram with gensim


I'm trying to mimick the n_gram parameter in CountVectorizer() with gensim. My goal is to be able to use LDA with Scikit or Gensim and to find very similar bigrams.
For example, we can find the following bigrams with scikit: "abc computer", "binary unordered" and with gensim "A survey", "Graph minors"...
I have attached my code below to make a comparison between Gensim and Scikit in terms of bigrams/unigrams.
Thanks for your help
documents = [["Human" ,"machine" ,"interface" ,"for" ,"lab", "abc" ,"computer" ,"applications"],
["A", "survey", "of", "user", "opinion", "of", "computer", "system", "response", "time"],
["The", "EPS", "user", "interface", "management", "system"],
["System", "and", "human", "system", "engineering", "testing", "of", "EPS"],
["Relation", "of", "user", "perceived", "response", "time", "to", "error", "measurement"],
["The", "generation", "of", "random", "binary", "unordered", "trees"],
["The", "intersection", "graph", "of", "paths", "in", "trees"],
["Graph", "minors", "IV", "Widths", "of", "trees", "and", "well", "quasi", "ordering"],
["Graph", "minors", "A", "survey"]]
With the gensim model we find 48 unique tokens, we can print the unigram/bigrams with print(dictionary.token2id)
# 1. Gensim
from gensim.models import Phrases
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(documents, min_count=1)
for idx in range(len(documents)):
for token in bigram[documents[idx]]:
if '_' in token:
# Token is a bigram, add to document.
documents[idx].append(token)
documents = [[doc.replace("_", " ") for doc in docs] for docs in documents]
print(documents)
dictionary = corpora.Dictionary(documents)
print(dictionary.token2id)
And with the scikit 96 unique tokens, we can print scikit's vocabulary with print(vocab)
# 2. Scikit
import re
token_pattern = re.compile(r"\b\w\w+\b", re.U)
def custom_tokenizer( s, min_term_length = 1 ):
"""
Tokenizer to split text based on any whitespace, keeping only terms of at least a certain length which start with an alphabetic character.
"""
return [x.lower() for x in token_pattern.findall(s) if (len(x) >= min_term_length and x[0].isalpha() ) ]
from sklearn.feature_extraction.text import CountVectorizer
def preprocess(docs, min_df = 1, min_term_length = 1, ngram_range = (1,1), tokenizer=custom_tokenizer ):
"""
Preprocess a list containing text documents stored as strings.
doc : list de string (pas tokenizé)
"""
# Build the Vector Space Model, apply TF-IDF and normalize lines to unit length all in one call
vec = CountVectorizer(lowercase=True,
strip_accents="unicode",
tokenizer=tokenizer,
min_df = min_df,
ngram_range = ngram_range,
stop_words = None
)
X = vec.fit_transform(docs)
vocab = vec.get_feature_names()
return (X,vocab)
docs_join = list()
for i in documents:
docs_join.append(' '.join(i))
(X, vocab) = preprocess(docs_join, ngram_range = (1,2))
print(vocab)
gensim Phrases class is designed to "Automatically detect common phrases (multiword expressions) from a stream of sentences."
So it only gives you bigrams that "appear more frequently than expected". That's why with gensim package you only get a few bigrams like : 'response time', 'Graph minors', 'A survey'.
If you look at bigram.vocab you'll see that these bigrams appear 2 times whereas all others bigrams appear only one time.
scikit-learn's CountVectorizer class gives you all bigrams.

Related Links

Taking two lists and printing specific characters based on their postition
How to call a C++ function which returns a vector of doubles from Python?
Force datetime with hour and minutes to null pandas
Difference between _sql_constraints and _constraints on OpenERP/Odoo?
Is there a Python JSON encoder which JUST works?
Parse large split XML file(s) with Python
Tried everything but still cannot serve static files of Django project via nginx+gunicorn
Bin values based on ranges with pandas
Read specific lines from text file as numpy array
Automatically remove referencing objects on deletion by mongoengine in django
How to get values from split function?
Developing Geoprocessing Tool in ArcPy
Constructing efficient functions in Python
Numpy combine all nonzero elements of one array in to another
Overwriting a list in a two for loop function (Python)
Append to a DataFrame in Pandas as new column

Categories

HOME
itext
sed
angular2-routing
youtube-livestreaming-api
mapbox-gl-js
playframework
mainframe
terrain
histogrammar
dataframe
tizen-wearable-sdk
simpy
shader
gimp
constructor
flexbox
gatsby
squarespace
mod-pagespeed
ejs
netezza
cross-platform
android-toolbar
statusbar
fop
piwik
bcrypt
jflex
visual-c++-2017
widevine
solidworks
aurelia-binding
red-black-tree
lmdb
protobuf-net
read-write
hawtio
shibboleth
installshield-2012
html-agility-pack
powermta
jtds
appstore-approval
spring-data-hadoop
dwarf
greenhills
udev
openh264
text-classification
forerunnerdb
etherpad
sqlexception
createobject
objloader
atlassian-crowd
update-site
pydio
multiple-file-upload
linkageerror
codesign
group-concat
loopj
throughput
lemoon
zuora
apigee-baas
anti-patterns
risk-analysis
prezto
criteria-api
sparse-file
debian-based
rspec3
subversion-edge
google-plus-one
pl-i
inputaccessoryview
line-numbers
errorprovider
suffix-array
smartystreets
ssms-addin
factors
db4o
baucis
wic
bfd
.net-cf-3.5
digital-design
fraud-prevention
sametime
pitch
continuous-testing
macruby
getstring
shared-objects
qtembedded
opengl-es-lighting
gtk2hs
email-spec
winverifytrust
visual-c++-2008-express
windows-controls
hmacsha1
nt4

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App