python


pandas parse csv with newlines


A while ago I used quotes on both sides of my data and read it into pandas pandas parse csv with left and right quote chars now, I also need to support newlines and some weird characters.
Minimum sample below, the first string (temp) will work just fine, but the second one won't parse properly.
import pandas as pd
import os
from pandas.compat import StringIO
temp=u"""<first>$$><$$<second>$$><$$<first>
<foo>$$><$$<bar>$$><$$<baz>"""
temp=u"""<first>$$><$$<second>$$><$$<third>
<foo>$$><$$<bar>$$><$$<baz>
<foo>$$><$$<Green; kkkk 101; aaaa, bbb; [foo<1>>aaa<123>>xxx<1>>zzz<1.17989207 | 18187681 | asdf |>>
;sdf{
}
;ADD{
]>$$><$$<baz>"""
big_df = pd.read_csv(StringIO(temp),
encoding='utf8',
sep='\$\$><\$\$',
decimal=',',
engine='python') # we cant use pandas optimized C parser due to our special delimiters.
big_df.iloc[:, -1] = big_df.iloc[:, -1].str.replace('\$\$>$', '')
big_df.iloc[:, -1] = big_df.iloc[:, -1].str.replace('\$\$>$', '')
big_df = big_df.replace(['^<', '>$'], ['', ''], regex=True)
big_df.columns = big_df.columns.to_series().replace(['^<', '>$', '>\$\$'], ['', '', ''], regex=True)
big_df
edit
As outlined in the comment - when putting all onto a single line it works just fine.
How could I automate this maybe via sed/Awk?
awk '{printf("%s ",$0)} END{print ""}' sample.csv will remove all new lines and concatenate everything into a single line. I would rather only want to remove the problematic newlines.
awk -F, 'NF < 4 {getline nextline; $0 = $0 nextline} 1' sample.csv will already remove the normal newlines. But still there are the additional blank lines.

So your "real" newlines are marked with $$>\n. Read your file in to string, replace $$>\n with something temporary, remove any remaining newlines, reinsert the "real" newlines, then pass to read_csv().
temp = temp.replace('$$>\n', '%%NEWLINE%%').replace('\n','').replace('%%NEWLINE%%', '\n')
big_df = pd.read_csv(StringIO(temp), ...)


Related Links

Python script never run from batch file
Issues in Plotting Intraday OHLC Chart with Matplotlib
z3opt python — minimizing square
Which Python multiprocessing inputs resulted in a timeout?
Change the value of Global Variable in Python3
how to load my own data or online dataset in python for training CNN or autoencoder?
python logging - message not showing up in child
How do I get python unittest to test that a function returns a csv.reader object?
css selectror not working in selenium with python
Python: Sqlalchemy messing up pyinstaller?
django rest auth facebook code for login
Adding Namespaces to a DOM Element python
Using scipy.integrate.ode with cython
need help about SVM in Python and Matlab
How can a portal user modify his own partner data in Odoo 8?
Name or service not known

Categories

HOME
visual-studio-2015
numpy
tinyos
vhdl
rotation
wildcard
vs-team-services
smartphone
reportportal
computer-vision
concourse
mainframe
vlc
wampserver
currency
goutte
cakephp-2.9
ndis
google-shopping
gatsby
maxima
uicollectionview
saiku
atlassian-plugin-sdk
prestodb
jsonserializer
pdfsharp
switching
data-conversion
vertex-buffer
dhtmlx-scheduler
shapes
openpgp
precedence
geomesa
pmwiki
vmd
freetts
parent
nsexception
crop
python-idle
rdw
rollback
udev
slot
windows-95
instruction-set
barcode-printing
chessboard.js
brython
nssplitview
spatial-query
remoteapp
elastix
fill
yoothemes
approval-tests
android-viewholder
retro-computing
pgm
pylearn
chaining
django-1.6
tilestache
errorprovider
ember-components
humanizer
system.web
aquafold
help-viewer
ksoap2
dynamic-proxy
healthvault
installshield-2011
threadx
surveyor-gem
fileconveyor
appfog
enumerators
returnurl
radchart
plone-funnelweb
shim
ninject-extensions
viewdidload
hinstance
emacs23
cross-domain-policy
windows-live-id
adsl
gethashcode
hardware-acceleration
gwt-2.2-celltable
chdatastructures
iphone-sdk-3.2
graph-layout
msf





Mobile Apps Dev
Database Users
javascript
java
csharp
php
android


MS Developer
developer works
python
ios
c
html
jquery


RDBMS discuss