python


Getting regular text from wikipedia page


I am trying to get the text or the summary text from a random wikipedia page, i need it, to be a list of lists of words (list of sentences) in the end.
I am using the following code
def get_random_pages_summary(pages = 0):
import wikipedia
page_names = [wikipedia.random(1) for i in range(pages)]
return [[p,wikipedia.page(p).summary] for p in page_names]
def text_to_list_of_words_without_new_line(text):
t = text.replace("\n", " ").strip()
t1 = t.split()
t2 = ["".join(w) for w in t1]
return t2
text = get_random_pages_summary(1)
for i,row in enumerate(text):
text[i][1] = text_to_list_of_words_without_new_line(row[1])
print text[0][1]
I am getting weird tokens, i assume they are a relic of the markdown code for the wikipedia page e.g
Russian:', u'\u0418\u0432\u0430\u043d
I found that it is probably happening when there is a quote from another language inside the English page, it also happens when having a range of years in the page e.g 2015-2016
I would like to convert all of these to regular words, and remove those that i can not convert to regular words.
Thanks.


Related Links

forms in django, overriding validation on file upload to make sure just one value is there
How to animate font size without text reordering, when `text_size=self.size`
How to import the last row from array of excel files to another excel using openpyxl
how to terminate a thread which calls the webbrowser in python
Can I load HTML on Ghost.py
Combining two lists of names and sorting them to make one sorted list of names
Python equivalent of bash sort lexicographical and numerical
Why isn't my frames background showing?
Trying to do a natural join using python standard library
How to combine several querysets by key in common?
How I can speed up row column access to pandas dataframe?
Create list with combinations of 3 elements of other list with repetitions
PyQt5 does not change gifs
Pydub - combine split_on_silence with minimum length / file size
how to choose python version accordingly in pycharm?
Unable to import Flask to Kivy iOS

Categories

HOME
dotnetrdf
jdbc
mql4
crate
reverse-engineering
warnings
ocaml
rocketmq
hyperledger-fabric
rebol
flexbox
wicket
informatica-powercenter
activesync
x-frame-options
pdo
scala-native
alfresco-share
eclipselink
prediction
apply
richfaces
remove-method
azure-servicebus-queues
nmf
logarithm
osmdroid
pptp
version-numbering
frp
reply
gzip
xor
protobuf-net
android-vpn-service
graph-databases
shapes
hawtio
division
recurrence-relation
viewstate
android-download-manager
parse-android-sdk
jquery-multidatespicker
alchemy.js
android-preferences
stress-testing
catia
onmouseover
aqgridview
ajp
gd
chunked-encoding
xmgrace
google-maps-ios
launch
date-range
redux-router
nsfetchedresultscontrolle
alter
tomee
excon
webkit2
void
lemoon
getrusage
document-oriented-db
jericho-html-parser
registrykey
fabric-twitter
windowlistener
sevenzipsharp
digits
subversion-edge
jcr-sql2
temp-tables
dalekjs
author
data-generation
map-force
edit-in-place
lov
visual-studio-addins
exiv2
jquery-tabs
convex-polygon
layered
installshield-2009
fileconveyor
client-library
magickwand
wpdb
mvs
querystringparameter
search-path
sitemappath
asp.net-mvc-controller
visual-studio-2010-beta-2
castle-validators
document-library





Mobile Apps Dev
Database Users
javascript
java
csharp
php
android


MS Developer
developer works
python
ios
c
html
jquery


RDBMS discuss