python


Getting regular text from wikipedia page


I am trying to get the text or the summary text from a random wikipedia page, i need it, to be a list of lists of words (list of sentences) in the end.
I am using the following code
def get_random_pages_summary(pages = 0):
import wikipedia
page_names = [wikipedia.random(1) for i in range(pages)]
return [[p,wikipedia.page(p).summary] for p in page_names]
def text_to_list_of_words_without_new_line(text):
t = text.replace("\n", " ").strip()
t1 = t.split()
t2 = ["".join(w) for w in t1]
return t2
text = get_random_pages_summary(1)
for i,row in enumerate(text):
text[i][1] = text_to_list_of_words_without_new_line(row[1])
print text[0][1]
I am getting weird tokens, i assume they are a relic of the markdown code for the wikipedia page e.g
Russian:', u'\u0418\u0432\u0430\u043d
I found that it is probably happening when there is a quote from another language inside the English page, it also happens when having a range of years in the page e.g 2015-2016
I would like to convert all of these to regular words, and remove those that i can not convert to regular words.
Thanks.

Related Links

Images broadcast over UDP socket (Python)
Index class as list and as dictionary
Python “will the arrow fly straight program”
Processing an eventlog with Pandas - find next match in DataFrame
Intellij python plugin debugging the script copied under the target directory
Create a list with repeated values with list comprehension
Is LASSO regression implemented in Statsmodels?
There is a duplicate line showed when calling a __init__ in Python script
trouble with mousewheel + scrollbars in tkinter
Python Cutting a string on a certain point
Ansible become_user error UnicodeEncodeError: 'ascii' codec can't encode character
Dynamic way to create new columns as a function of existing columns in pandas
polymorphic dispatch: distinguishing Python integers vs. floating-point numbers vs. strings
HTCondor output files: obtain created directory
getting select values with flask [duplicate]
Tensorflow tf.matmul example is incorrect?

Categories

HOME
paypal
laravel-5
ajax
vhdl
itext
smartphone
bibtex
javamail
azureservicebus
checksum
bro
gimp
desktop
google-plus
android-viewpager
eclipse-cdt
x11
scenebuilder
multicore
cvs2svn
onsen-ui
csh
python-textprocessing
pipelinedb
mayavi
google-api-dotnet-client
dhtmlx-scheduler
expand
web-audio-api
modulo
phpspreadsheet
lego
rhel.net
restful-url
productivity
jrules
context-free-language
email-parsing
vcf
chunked-encoding
uiviewpropertyanimator
decompress
rpostgresql
opencpu
sqlexception
rxvt
cjson
ebtables
cleditor
c64
manifest.mf
gitweb
pax-web
unsatisfiedlinkerror
soda
gemini
remoteapp
freefem++
mfc-feature-pack
interactive-brokers
except
wikitext
evo
lmfit
ipojo
integral
xaml-designer
hiera
lexicographic
obfuscar
activity-streams
cloudpebble
libsndfile
errorprovider
grunt-contrib-concat
snmp4j
svcutil.exe
operations
circos
bfd
windows-mobile-6
sql-server-2012-web
sttwitter
client-library
vt100
pys60
shim
rose-db-object
supersized
flash-cs5.5
jquery-selectbox
fileutils
throttling
surf
datamember
django-nose
out-of-browser

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html