python


How to cluster sparse data using Sklearn Kmeans


How do you cluster sparse data using Sklearn's Kmeans implementation?
Attempting to adapt their example for my own use case, I tried:
from sklearn.feature_extraction import DictVectorizer
from sklearn.cluster import KMeans
mydata = [
(1, {'word1': 2, 'word3': 6, 'word7': 4}),
(2, {'word11': 1, 'word7': 9, 'word3': 2}),
(3, {'word5': 7, 'word1': 3, 'word9': 8}),
]
kmeans_data = []
for index, raw_data in mydata:
cnt_sum = float(sum(raw_data.values()))
freqs = dict((k, v/cnt_sum) for k, v in raw_data.items())
v = DictVectorizer(sparse=True)
X = v.fit_transform(freqs)
kmeans_data.append(X)
kmeans = KMeans(n_clusters=2, random_state=0).fit(kmeans_data)
but this throws the exception:
File "/myproject/.env/lib/python3.5/site-packages/sklearn/cluster/k_means_.py", line 854, in _check_fit_data
X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
File "/myproject/.env/lib/python3.5/site-packages/sklearn/utils/validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
Presumably I'm not constructing my sparse input matrix X correctly, as it's a list of sparse matrices instead of a sparse matrix containing lists. How do I construct a proper input matrix?
You are building a sparse matrix incrementally. I am not sure if you could use DictVectorizer in an incremental manner. It would be simpler to just add the elements to the matrix one by one. See the last example in scipy.sparse.csr_matrix documentation.
Incremental construction
Consider the following double loop:
data = []
rows = []
cols = []
vocabulary = {}
for index, raw_data in mydata:
cnt_sum = float(sum(raw_data.values()))
for k,v in raw_data.items():
f = v/cnt_sum
i = vocabulary.setdefault(k,len(vocabulary))
cols.append(i)
rows.append(index-1)
data.append(f)
kmeans_data = csr_matrix((data,(rows,cols)))
Then kmeans_data is a sparse matrix suitable for use as input to K-means classifier.
Direct construction
With DictVectorizer you could construct the data matrix from the list of tuples and then use sparse linear algebra routines to perform normalization of rows.
# 1. Construct the sparse matrix with numbers_of_occurrences
D = [d[1] for d in mydata]
v = DictVectorizer(sparse=True)
kmeans_data = v.fit_transform(D)
# 2. Normalize by computing sums for each row and dividing
import numpy as np
sums = np.sum(kmeans_data,axis=1).A[:,0]
N = len(s)
divisor = csr_matrix((np.reciprocal(s),(range(N),range(N))))
kmeans_data = divisor*kmeans_data)

Related Links

Python - Iterate through, and extract, elements of a dictionary type list
Keeping track of dropped indices when dropping elements from numpy array
send data while redirect to view in django
Are there some light-weight python libraries to play music segments with poloyphony?
Python3: Client-Server communication
XML Parsing to .txt file Python
Is this Python unicode escape error?
Adding a variable number of sub-plots in a loop. add_subplot
Project Euler 36 Double-base Palindrome Python 3
CSV File filtering using django(python)
Inserting data into mysql using python
Win32api's keybd_event() function problems
How can you move data from column b and append to the end of column a using Python?
NLTK classification probability estimate with n-grams
I am trying to do an if statement with the input to be 21 and the result be Time printed out
Why won't my code work for all palindromes?

Categories

HOME
plsql
shopify
tinyos
checkbox
warnings
ncurses
agile
aem
cocos2d-x-3.0
concourse
matplotlib
zend-framework2
codeeffects
phaser-framework
apiconnect
internet-explorer-11
google-classroom
automata
event-log
mod-pagespeed
aspell
google-api-java-client
plupload
cvs2svn
url.action
spring-ws
onesignal
winscp
positioning
invoke-command
picturebox
smart-mobile-studio
pycparser
body-parser
nodeclipse
xamarin.uitest
image-compression
polyfills
intellilock
mediawiki-extensions
email-parsing
amazon-clouddrive
azure-availability-set
defold
broadcast
picasa
tomcat5
dts
foxit
freetype2
clob
squirrel
string-parsing
anythingslider
cctray
loose-typing
wikitext
app.xaml
team-build
manjaro
magento-1.12
google-admin-audit-api
dalekjs
libssh2
sthttprequest
system.web
slick-2.0
delphi-xe3
crystal-reports-10
edit-in-place
lov
internal
gamepad
loop-invariant
javascriptmvc
redirectstandardoutput
objectbrowser
dcpu-16
scalaxb
out-of-browser
hadoop-plugins
delegatecommand
gwt-2.2-celltable
plinq
soft-keyboard
opcodes
handheld
weborb
ajaxpro
visual-studio-dbpro

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App