python


How to cluster sparse data using Sklearn Kmeans


How do you cluster sparse data using Sklearn's Kmeans implementation?
Attempting to adapt their example for my own use case, I tried:
from sklearn.feature_extraction import DictVectorizer
from sklearn.cluster import KMeans
mydata = [
(1, {'word1': 2, 'word3': 6, 'word7': 4}),
(2, {'word11': 1, 'word7': 9, 'word3': 2}),
(3, {'word5': 7, 'word1': 3, 'word9': 8}),
]
kmeans_data = []
for index, raw_data in mydata:
cnt_sum = float(sum(raw_data.values()))
freqs = dict((k, v/cnt_sum) for k, v in raw_data.items())
v = DictVectorizer(sparse=True)
X = v.fit_transform(freqs)
kmeans_data.append(X)
kmeans = KMeans(n_clusters=2, random_state=0).fit(kmeans_data)
but this throws the exception:
File "/myproject/.env/lib/python3.5/site-packages/sklearn/cluster/k_means_.py", line 854, in _check_fit_data
X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
File "/myproject/.env/lib/python3.5/site-packages/sklearn/utils/validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
Presumably I'm not constructing my sparse input matrix X correctly, as it's a list of sparse matrices instead of a sparse matrix containing lists. How do I construct a proper input matrix?
You are building a sparse matrix incrementally. I am not sure if you could use DictVectorizer in an incremental manner. It would be simpler to just add the elements to the matrix one by one. See the last example in scipy.sparse.csr_matrix documentation.
Incremental construction
Consider the following double loop:
data = []
rows = []
cols = []
vocabulary = {}
for index, raw_data in mydata:
cnt_sum = float(sum(raw_data.values()))
for k,v in raw_data.items():
f = v/cnt_sum
i = vocabulary.setdefault(k,len(vocabulary))
cols.append(i)
rows.append(index-1)
data.append(f)
kmeans_data = csr_matrix((data,(rows,cols)))
Then kmeans_data is a sparse matrix suitable for use as input to K-means classifier.
Direct construction
With DictVectorizer you could construct the data matrix from the list of tuples and then use sparse linear algebra routines to perform normalization of rows.
# 1. Construct the sparse matrix with numbers_of_occurrences
D = [d[1] for d in mydata]
v = DictVectorizer(sparse=True)
kmeans_data = v.fit_transform(D)
# 2. Normalize by computing sums for each row and dividing
import numpy as np
sums = np.sum(kmeans_data,axis=1).A[:,0]
N = len(s)
divisor = csr_matrix((np.reciprocal(s),(range(N),range(N))))
kmeans_data = divisor*kmeans_data)

Related Links

when I initiate second button click -> AttributeError: Application instance has no attribute 'readfile'
Python 3.4 with an older Python script for use in Blender, TypeErrors
error when installing numpy for pypy2.2.1
How to find out in which locale it was encoded to?
alembic revision - multiple heads (due branching) error
Python set value multiindex Pandas
Multiplication table for double digit numbers using nested loops in Python
pandas time_range does not start from start date
Django How about my method to solve anti spam post request? Is there better solutions?
How can I write a csv file with multiple header lines with pandas to_csv()?
Connect to RDS from EC2 instance with Python
Make a function that extract value from a list and use it to match with another list that has a tuple in it
Memory address when function handle called in Python
Overwrite Property
Share file stream between Python and C
Russian String Formatting in Flowable

Categories

HOME
json
visual-studio-2015
ajax
payment-gateway
redis
jar
system-verilog
ocaml
vs-team-services
wxwidgets
smartphone
deep-linking
cocos2d-x-3.0
gspread
getorgchart
cakephp-2.9
game-physics
phoenix
uisplitviewcontroller
ssms-2016
opera-mini
uicollectionview
python-unicode
session-timeout
off-canvas-menu
ios10.3
distance
ghost-inspector
bcrypt
npm-install
branch
dss
dpi
pep8-assembly
picturebox
hotmail
receipt
qtablewidget
mmdrawercontroller
resuming-training
mod-fcgid
context-free-language
azure-availability-set
nstouchbar
shutdown
rollback
photon-controller
butterknife
photography
cdk
adler32
lotus
payu
windows-azure-pack
xib
icefaces
galleriffic
vst
multiple-file-upload
persistence.xml
sourcegear-vault
fabric-twitter
js-cookie
futuretask
webhdfs
ffserver
ytplayerview
divide-by-zero
adobe-indesign
google-plus-one
google-earth-plugin
hana-xs
c++-actor-framework
omnifocus
microblaze
docopt
alertifyjs
slick-2.0
delphi-xe3
initialization-vector
visual-c++-2010-express
sim900
bigint
ruboto
multiple-conditions
cgimageref
shortcuts
adomd.net
gdataxml
ubuntu-11.10
blockingqueue
text-services-framework
ubuntu-11.04
nssavepanel
galaxy-tab
sendfile
ruby-debug
out-of-browser
getresource
scala-2.8
aio

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App