Getting regular text from wikipedia page
I am trying to get the text or the summary text from a random wikipedia page, i need it, to be a list of lists of words (list of sentences) in the end. I am using the following code def get_random_pages_summary(pages = 0): import wikipedia page_names = [wikipedia.random(1) for i in range(pages)] return [[p,wikipedia.page(p).summary] for p in page_names] def text_to_list_of_words_without_new_line(text): t = text.replace("\n", " ").strip() t1 = t.split() t2 = ["".join(w) for w in t1] return t2 text = get_random_pages_summary(1) for i,row in enumerate(text): text[i] = text_to_list_of_words_without_new_line(row) print text I am getting weird tokens, i assume they are a relic of the markdown code for the wikipedia page e.g Russian:', u'\u0418\u0432\u0430\u043d I found that it is probably happening when there is a quote from another language inside the English page, it also happens when having a range of years in the page e.g 2015-2016 I would like to convert all of these to regular words, and remove those that i can not convert to regular words. Thanks.
What is the MATLAB equivalent of a named tuple in Python?
Differences of scipy.spatial.KDTree in python 2.7 and 3.5
Getting correct exogenous least squares prediction in Python statsmodels
How can I merge two beautiful soup tags?
Python String Query
How to create and use multiple pipes within the same process with pexpect?
Python Error 24: too many files open: Per Process Limit?
How to plot a Gaussian function on Python?
Running a script for many files of the same extension. Getting 'UnboundLocalError'
Connecting to IBM AS400 server for database operations hangs
scikit learn: polynomial interpolation of higher dimensions
Centralized Django Installations with VirtualEnv
Matplotlib - Draw points that satisfy condition
Lektor Pagination - TemplateSyntaxError: Encountered unknown tag 'endblock'
Can anyone help me out with the ASCII part please
Compare two databases for any differences