python
pandas parse csv with newlines
A while ago I used quotes on both sides of my data and read it into pandas pandas parse csv with left and right quote chars now, I also need to support newlines and some weird characters. Minimum sample below, the first string (temp) will work just fine, but the second one won't parse properly. import pandas as pd import os from pandas.compat import StringIO temp=u"""<first>$$><$$<second>$$><$$<first> <foo>$$><$$<bar>$$><$$<baz>""" temp=u"""<first>$$><$$<second>$$><$$<third> <foo>$$><$$<bar>$$><$$<baz> <foo>$$><$$<Green; kkkk 101; aaaa, bbb; [foo<1>>aaa<123>>xxx<1>>zzz<1.17989207 | 18187681 | asdf |>> ;sdf{ } ;ADD{ ]>$$><$$<baz>""" big_df = pd.read_csv(StringIO(temp), encoding='utf8', sep='\$\$><\$\$', decimal=',', engine='python') # we cant use pandas optimized C parser due to our special delimiters. big_df.iloc[:, -1] = big_df.iloc[:, -1].str.replace('\$\$>$', '') big_df.iloc[:, -1] = big_df.iloc[:, -1].str.replace('\$\$>$', '') big_df = big_df.replace(['^<', '>$'], ['', ''], regex=True) big_df.columns = big_df.columns.to_series().replace(['^<', '>$', '>\$\$'], ['', '', ''], regex=True) big_df edit As outlined in the comment - when putting all onto a single line it works just fine. How could I automate this maybe via sed/Awk? awk '{printf("%s ",$0)} END{print ""}' sample.csv will remove all new lines and concatenate everything into a single line. I would rather only want to remove the problematic newlines. awk -F, 'NF < 4 {getline nextline; $0 = $0 nextline} 1' sample.csv will already remove the normal newlines. But still there are the additional blank lines.
So your "real" newlines are marked with $$>\n. Read your file in to string, replace $$>\n with something temporary, remove any remaining newlines, reinsert the "real" newlines, then pass to read_csv(). temp = temp.replace('$$>\n', '%%NEWLINE%%').replace('\n','').replace('%%NEWLINE%%', '\n') big_df = pd.read_csv(StringIO(temp), ...)
Related Links
CNTK Python how to pass multiple features into model
Python Pandas GroupBy % calculation
Programmatic copy and paste XML Node in MS Word Document?
What specific requirements does the function passed to scipy.optimize.curve_fit need to fulfill in order to run?
Organizing daily Excel data into xarray dataset
Trouble creating MSI installer with electron
Error getting json using oauthlib python
How to extend instance with no class inheritance [duplicate]
How to check for inclusion of multisets?
why cv2.imwrite() changes the color of pics?
Optimize data conversion program to avoid memory error
Flask list of last used pages with sessions TypeError
python sqlite3.OperationalError: near “-”: syntax error
Using bokeh to select a data region within a Jupyter Notebook
Using asyncio nested_future() and gather() with nested loops
why does no picture show