python


Getting regular text from wikipedia page


I am trying to get the text or the summary text from a random wikipedia page, i need it, to be a list of lists of words (list of sentences) in the end.
I am using the following code
def get_random_pages_summary(pages = 0):
import wikipedia
page_names = [wikipedia.random(1) for i in range(pages)]
return [[p,wikipedia.page(p).summary] for p in page_names]
def text_to_list_of_words_without_new_line(text):
t = text.replace("\n", " ").strip()
t1 = t.split()
t2 = ["".join(w) for w in t1]
return t2
text = get_random_pages_summary(1)
for i,row in enumerate(text):
text[i][1] = text_to_list_of_words_without_new_line(row[1])
print text[0][1]
I am getting weird tokens, i assume they are a relic of the markdown code for the wikipedia page e.g
Russian:', u'\u0418\u0432\u0430\u043d
I found that it is probably happening when there is a quote from another language inside the English page, it also happens when having a range of years in the page e.g 2015-2016
I would like to convert all of these to regular words, and remove those that i can not convert to regular words.
Thanks.


Related Links

How to create list comprehension of absolute paths
how do I create a string of serial data only when the serial port is high and ends when the line goes low?
Unsure where to call my function in my other function for counting occurrences in a string
Selenium Python: Unable to locate element
PyBluez 'Connection reset by peer' on a Raspberry Pi
Segmentation fault in a python with queue and mutithreading
Converting a Text File to a String in Python
Reading default arguments with argparse
Project Euler #27 Python
Python sys.path.append with modules shadowing each other
Performance problems with dict as a decision tree
Multinomial sample generation in python
Work with hidden select elemens in splinter (and selenium)
Drive API doc upload from a server without user interaction
max-Function for attributes of objects
Parsing GET request data with from SimpleHTTPServer

Categories

HOME
ajax
cluster-computing
ssas-2012
admin-on-rest
google-cloud-bigtable
shopify
jsviews
avro
sbt-assembly
ios10
requirejs
wicket
jboss7.x
x-frame-options
maxima
future
squarespace
datagrip
rhapsody
atlassian-plugin-sdk
google-api-java-client
prestodb
ejbca
shippo
sonarqube-msbuild-runner
esql
deb
extractor
jquery-waypoints
myob
logarithm
tfs2013
yosys
cep
hana-studio
crt
twitter-bootstrap-2
canvasjs
mapguide
wfp
mms
redhat-datavirt
issue-tracking
dql
code-behind
concrete5-5.7
jquery-multidatespicker
exiftool
jrules
paho
crop
upsert
filepath
onmouseover
jtds
proof
production-environment
drawingarea
istorage
broadcast
maintenance
crystal-reports-8.5
android-bitmap
gce
photography
quintus
static-code-analysis
search-box
selenium-firefoxdriver
laravel-query-builder
barcode-printing
verbose
gradle-script-kotlin
typhoon
rmongodb
pydio
slam-algorithm
visible
try-finally
multiple-file-upload
unsatisfiedlinkerror
nssplitview
mongo-c-driver
mpeg-4
twgl.js
aerogear
bluesnap
difference
hornetq
application-loader
zuora
p-np
registrykey
python-winshell
vine
xjc
pyopengl
findcontrol
cpu-cores
mdm-zinc
hg-git
consensus
and-operator
bullet
phpdocx
kraken.js
broadcasting
lov
installshield-2009
qss
arbor.js
tfs-power-tools
gdt
file-exists
shared-objects
opengl-es-lighting
scalaxb
audiostreamer
hadoop-plugins
revert
code-design
divx





Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm