python


Getting regular text from wikipedia page


I am trying to get the text or the summary text from a random wikipedia page, i need it, to be a list of lists of words (list of sentences) in the end.
I am using the following code
def get_random_pages_summary(pages = 0):
import wikipedia
page_names = [wikipedia.random(1) for i in range(pages)]
return [[p,wikipedia.page(p).summary] for p in page_names]
def text_to_list_of_words_without_new_line(text):
t = text.replace("\n", " ").strip()
t1 = t.split()
t2 = ["".join(w) for w in t1]
return t2
text = get_random_pages_summary(1)
for i,row in enumerate(text):
text[i][1] = text_to_list_of_words_without_new_line(row[1])
print text[0][1]
I am getting weird tokens, i assume they are a relic of the markdown code for the wikipedia page e.g
Russian:', u'\u0418\u0432\u0430\u043d
I found that it is probably happening when there is a quote from another language inside the English page, it also happens when having a range of years in the page e.g 2015-2016
I would like to convert all of these to regular words, and remove those that i can not convert to regular words.
Thanks.

Related Links

What is the MATLAB equivalent of a named tuple in Python?
Differences of scipy.spatial.KDTree in python 2.7 and 3.5
Getting correct exogenous least squares prediction in Python statsmodels
How can I merge two beautiful soup tags?
Python String Query
How to create and use multiple pipes within the same process with pexpect?
Python Error 24: too many files open: Per Process Limit?
How to plot a Gaussian function on Python?
Running a script for many files of the same extension. Getting 'UnboundLocalError'
Connecting to IBM AS400 server for database operations hangs
scikit learn: polynomial interpolation of higher dimensions
Centralized Django Installations with VirtualEnv
Matplotlib - Draw points that satisfy condition
Lektor Pagination - TemplateSyntaxError: Encountered unknown tag 'endblock'
Can anyone help me out with the ASCII part please
Compare two databases for any differences

Categories

HOME
answer-set-programming
scipy
boost-thread
translation
jasmine
vs-team-services
youtube-livestreaming-api
transparent
matplotlib
sbt-assembly
orientation
ll
rebol
urlencode
docker-windows
xcode8.3
criteria
google-shopping
camera-calibration
ctypes
local
symfony-forms
webmethods
postmessage
riak
off-canvas-menu
internet-explorer-8
linq-to-sql
marathon
sensu
os161
interrupt-handling
metadata-extractor
version-numbering
xilinx-ise
reply
pep8-assembly
red-black-tree
m2e
division
rhino
pox
mmdrawercontroller
jboss-esb
brute-force
lexical-analysis
bayesian-networks
dojox.mobile
amd
dimple.js
appstore-approval
web-deployment-project
rollback
chrome-remote-desktop
slot
slidesjs
communication-protocol
visual-studio-templates
polar-coordinates
build-process
gradle-script-kotlin
debugdiag
excon
dbscan
gradle-release-plugin
getrusage
evo
rad
programming-paradigms
digits
uos
gulp-livereload
custom-url
magento-1.12
multipleselection
start-job
obfuscar
line-numbers
genetic-programming
system.web
bullet
proxies
joox
dbsetup
crystal-reports-10
json-patch
arangodb-php
onselect
goinstant
shortcuts
qtgui
fileconveyor
reporting-tools
moq-3
couchdb-lucene
surf
svn-hooks
hardware-acceleration
file-encodings
oggvorbis
usability-testing
document-library

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App