python


pandas parse csv with newlines


A while ago I used quotes on both sides of my data and read it into pandas pandas parse csv with left and right quote chars now, I also need to support newlines and some weird characters.
Minimum sample below, the first string (temp) will work just fine, but the second one won't parse properly.
import pandas as pd
import os
from pandas.compat import StringIO
temp=u"""<first>$$><$$<second>$$><$$<first>
<foo>$$><$$<bar>$$><$$<baz>"""
temp=u"""<first>$$><$$<second>$$><$$<third>
<foo>$$><$$<bar>$$><$$<baz>
<foo>$$><$$<Green; kkkk 101; aaaa, bbb; [foo<1>>aaa<123>>xxx<1>>zzz<1.17989207 | 18187681 | asdf |>>
;sdf{
}
;ADD{
]>$$><$$<baz>"""
big_df = pd.read_csv(StringIO(temp),
encoding='utf8',
sep='\$\$><\$\$',
decimal=',',
engine='python') # we cant use pandas optimized C parser due to our special delimiters.
big_df.iloc[:, -1] = big_df.iloc[:, -1].str.replace('\$\$>$', '')
big_df.iloc[:, -1] = big_df.iloc[:, -1].str.replace('\$\$>$', '')
big_df = big_df.replace(['^<', '>$'], ['', ''], regex=True)
big_df.columns = big_df.columns.to_series().replace(['^<', '>$', '>\$\$'], ['', '', ''], regex=True)
big_df
edit
As outlined in the comment - when putting all onto a single line it works just fine.
How could I automate this maybe via sed/Awk?
awk '{printf("%s ",$0)} END{print ""}' sample.csv will remove all new lines and concatenate everything into a single line. I would rather only want to remove the problematic newlines.
awk -F, 'NF < 4 {getline nextline; $0 = $0 nextline} 1' sample.csv will already remove the normal newlines. But still there are the additional blank lines.

So your "real" newlines are marked with $$>\n. Read your file in to string, replace $$>\n with something temporary, remove any remaining newlines, reinsert the "real" newlines, then pass to read_csv().
temp = temp.replace('$$>\n', '%%NEWLINE%%').replace('\n','').replace('%%NEWLINE%%', '\n')
big_df = pd.read_csv(StringIO(temp), ...)


Related Links

django model default=“” and “This field cannot be blank”
Scrapy - Importing Excel .csv as start_url
Make a simple audio player with a playback control with gstreamer1.0 and gtk3
error in writing a text file in python [closed]
ipython using 2.6 version instead of 2.7
Django: How to override authenticate() method?
Execute python script with a variable from linux shell
Output loop result into a list
Pyinstaller will create an executable with --onefile, but not without it
PyMongo update multiple records with multiple data
How to automatically input ssh private key passphrase with pexpect
Extract only the portion of a string between two regex patterns
Filtering dictionary keys by a function of their associated values
webrtc without a browser
Simple python IF statement not working?
Python: aliased to python3

Categories

HOME
visual-studio-2015
cil
laravel-5.3
json-ld
jscript
hyperledger-fabric
rebol
filter
shipping
cortex-a
uiscrollview
uiview
cratedb
postmessage
uicollectionview
bootstrap-popover
multiple-columns
scalaz7
url.action
multiplayer
spring-test
http-method
logarithm
csh
data-cleansing
katharsis
unpack
positioning
blackboard
websauna
xor
android-vpn-service
postgresql-9.2
read-write
construct-2
getline
modelandview
fabric-digits
large-data
remote-server
shibboleth
npm-publish
cgo
productivity
paho
amd
lumen-5.3
appstore-approval
jgraph
persistent
greenhills
butterknife
photography
in-memory-database
opencpu
unity3d-editor
ingres
emgu
svn-merge
settimeout
sql-server-administration
cjson
iostat
push-diffusion
nofollow
gherkin
ng2-material
mongoskin
fill
magento-1.4
portfolio
pyrocms
on-duplicate-key
leadtools-sdk
type-mismatch
sparse-file
ibm-data-studio
multipleselection
libsndfile
mser
dache
kraken.js
slick-2.0
map-force
openkinect
winrt-httpclient
visual-studio-addins
baucis
sim900
gamepad
magicalrecord-2.2
cdata
modeshape
haskell-platform
pep8
appfog
objcmongodb
getusermedia
pstree
wcf-web-api
gjs
text-services-framework
flexicious
hashalgorithm
rubycas
windows-live-id
carbide
hibernate3-maven-plugin
jquery-ui-button
longjmp
msf





Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm