Getting regular text from wikipedia page
I am trying to get the text or the summary text from a random wikipedia page, i need it, to be a list of lists of words (list of sentences) in the end. I am using the following code def get_random_pages_summary(pages = 0): import wikipedia page_names = [wikipedia.random(1) for i in range(pages)] return [[p,wikipedia.page(p).summary] for p in page_names] def text_to_list_of_words_without_new_line(text): t = text.replace("\n", " ").strip() t1 = t.split() t2 = ["".join(w) for w in t1] return t2 text = get_random_pages_summary(1) for i,row in enumerate(text): text[i] = text_to_list_of_words_without_new_line(row) print text I am getting weird tokens, i assume they are a relic of the markdown code for the wikipedia page e.g Russian:', u'\u0418\u0432\u0430\u043d I found that it is probably happening when there is a quote from another language inside the English page, it also happens when having a range of years in the page e.g 2015-2016 I would like to convert all of these to regular words, and remove those that i can not convert to regular words. Thanks.
Images broadcast over UDP socket (Python)
Index class as list and as dictionary
Python “will the arrow fly straight program”
Processing an eventlog with Pandas - find next match in DataFrame
Intellij python plugin debugging the script copied under the target directory
Create a list with repeated values with list comprehension
Is LASSO regression implemented in Statsmodels?
There is a duplicate line showed when calling a __init__ in Python script
trouble with mousewheel + scrollbars in tkinter
Python Cutting a string on a certain point
Ansible become_user error UnicodeEncodeError: 'ascii' codec can't encode character
Dynamic way to create new columns as a function of existing columns in pandas
polymorphic dispatch: distinguishing Python integers vs. floating-point numbers vs. strings
HTCondor output files: obtain created directory
getting select values with flask [duplicate]
Tensorflow tf.matmul example is incorrect?