python


Getting regular text from wikipedia page


I am trying to get the text or the summary text from a random wikipedia page, i need it, to be a list of lists of words (list of sentences) in the end.
I am using the following code
def get_random_pages_summary(pages = 0):
import wikipedia
page_names = [wikipedia.random(1) for i in range(pages)]
return [[p,wikipedia.page(p).summary] for p in page_names]
def text_to_list_of_words_without_new_line(text):
t = text.replace("\n", " ").strip()
t1 = t.split()
t2 = ["".join(w) for w in t1]
return t2
text = get_random_pages_summary(1)
for i,row in enumerate(text):
text[i][1] = text_to_list_of_words_without_new_line(row[1])
print text[0][1]
I am getting weird tokens, i assume they are a relic of the markdown code for the wikipedia page e.g
Russian:', u'\u0418\u0432\u0430\u043d
I found that it is probably happening when there is a quote from another language inside the English page, it also happens when having a range of years in the page e.g 2015-2016
I would like to convert all of these to regular words, and remove those that i can not convert to regular words.
Thanks.

Related Links

flask: The requested URL was not found on this server
UnicodeDecodeError: cp932 codec can't decode byte 0x81 in position 81
Python 3.5 Selenium Webdriver IE
Python collect all checkable items in qtreeivew
How to create a data type for a periodic time interval in Python
Recording data over time in Django Models
Two Dataframes, one with more columns than the other -> Subtract and Combine
Finding the index of the first element of a list in another list
GTK+3: Passing colours from Python to CSS
efficiently fill a tensor in numpy
Python: Building a Reentrant Semaphore (combining RLock and Semaphore)
I cannot find my mistake in python [closed]
row wise sorting in pandas dataframe and remove duplicates
calling SQL functions from Blaze
Using Spark DataFrame and Window Functions to calculate the rolling average return
Pyspark: merging values in a nested list

Categories

HOME
crystal-reports
vue.js
numpy
angular-cli
ipython
uibutton
paramiko
facebook-graph-api
ubuntu-16.04
json-ld
aem
docker-swarm
hana
vlc
apple-numbers
value
shader
gimp
hosts
cockroachdb
xlsx
tee
sign
line-api
event-log
ssms-2016
zoomcharts
aptana
styles
extractor
python-textprocessing
data-extraction
rhomobile
large-data
css-counter
color-scheme
agent
delphi-10.1-berlin
ddms
nashorn
knockout-3.0
datalog
constants
paho
amd
onmouseover
jags
spring-data-hadoop
castle-dynamicproxy
emgu
verbose
coremidi
alter
glkit
webkit2
spark-cassandra-connector
nofollow
image-editing
kendo-combobox
mpeg-4
twgl.js
yoothemes
handlebars.java
difference
insertion-sort
android-cursoradapter
intellitest
android-snackbar
pyrocms
mptcp
vine
scrollspy
word-2013
proximity
omnifocus
scala-2.11
grunt-contrib-concat
jstat
system.web
io.js
kraken.js
ifft
jstack
reserved-words
crystal-reports-10
sqljdbc
cling
logentries
modeshape
osx-snow-leopard
cilk-plus
robospice
javascriptmvc
cgimageref
fileconveyor
appfog
arbor.js
trailing-slash
xgettext
feof
sunspot-rails
static-variables
redirectstandardoutput
ninject-extensions
windows-phone-7-emulator
text-services-framework
concurrent-programming
law-of-demeter
path-manipulation
galaxy-tab
windows-live-id
scalaxb
junit3
trampolines
internals
document-library

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App