python


pandas parse csv with newlines


A while ago I used quotes on both sides of my data and read it into pandas pandas parse csv with left and right quote chars now, I also need to support newlines and some weird characters.
Minimum sample below, the first string (temp) will work just fine, but the second one won't parse properly.
import pandas as pd
import os
from pandas.compat import StringIO
temp=u"""<first>$$><$$<second>$$><$$<first>
<foo>$$><$$<bar>$$><$$<baz>"""
temp=u"""<first>$$><$$<second>$$><$$<third>
<foo>$$><$$<bar>$$><$$<baz>
<foo>$$><$$<Green; kkkk 101; aaaa, bbb; [foo<1>>aaa<123>>xxx<1>>zzz<1.17989207 | 18187681 | asdf |>>
;sdf{
}
;ADD{
]>$$><$$<baz>"""
big_df = pd.read_csv(StringIO(temp),
encoding='utf8',
sep='\$\$><\$\$',
decimal=',',
engine='python') # we cant use pandas optimized C parser due to our special delimiters.
big_df.iloc[:, -1] = big_df.iloc[:, -1].str.replace('\$\$>$', '')
big_df.iloc[:, -1] = big_df.iloc[:, -1].str.replace('\$\$>$', '')
big_df = big_df.replace(['^<', '>$'], ['', ''], regex=True)
big_df.columns = big_df.columns.to_series().replace(['^<', '>$', '>\$\$'], ['', '', ''], regex=True)
big_df
edit
As outlined in the comment - when putting all onto a single line it works just fine.
How could I automate this maybe via sed/Awk?
awk '{printf("%s ",$0)} END{print ""}' sample.csv will remove all new lines and concatenate everything into a single line. I would rather only want to remove the problematic newlines.
awk -F, 'NF < 4 {getline nextline; $0 = $0 nextline} 1' sample.csv will already remove the normal newlines. But still there are the additional blank lines.
So your "real" newlines are marked with $$>\n. Read your file in to string, replace $$>\n with something temporary, remove any remaining newlines, reinsert the "real" newlines, then pass to read_csv().
temp = temp.replace('$$>\n', '%%NEWLINE%%').replace('\n','').replace('%%NEWLINE%%', '\n')
big_df = pd.read_csv(StringIO(temp), ...)

Related Links

How to fit a Dictionary into a CSV file with all the values in the keys in a different column within the same row in python
how to retrieve the text within the label tag using python selenium ? i want to use this text to assert testcase to pass/fail
I'm getting an error on a ticTacToe game i'm making in python, can't figure out proper solving of this error anywhere
SELECT statements not grabbing data with Python Flask
python 3 class instances calling each others' attributes/variables?
“myobject” is not Iterable, but my obect is erased
SSL Certification Error > hostname doesn't match
Sequencing of elements in dict python
Keras error : Expected to see 1 array
import mxnet dmlc error Entry add_n already registered under different entry
How to open and execute kill task code through html
how do I remove unwanted empty dimension from xarray DataArray (squeeze doesn't work)
How to do math expression from input() in Python?
opening luigi.LocalTarget in binary read mode (decoding error)
GPU slows way down after a few loops
Google's verify_id_token function not working

Categories

HOME
ember.js
sidekiq
plsql
udp
crate
angular2-routing
steam
json-ld
cocos2d-x-3.0
chaiscript
vlc
flyway
value
gspread
dataframe
schemacrawler
websphere-liberty
octopus-deploy
facebook-javascript-sdk
jboss-eap-7
data-synchronization
jacoco
mod-pagespeed
opera-mini
jcl
anychart
internet-explorer-8
primes
shopping-cart
metadata-extractor
fetch-api
chrome-native-messaging
helper
solidworks
google-api-dotnet-client
multichoiceitems
bxslider
ocs
rhino
axis-labels
silverlight-3.0
installshield-2012
fax
brute-force
viewgroup
finite-element-analysis
ajp
recycle-bin
taskmanager
tiddlywiki
mercury
push-diffusion
vst
slam-algorithm
multifile-uploader
gradle-eclipse
uwp-maps
faraday
financial
sun-codemodel
axes
interactive-brokers
windows-vista
apigee-baas
gray-code
uos
gulp-livereload
google-hadoop
httpie
custom-url
multipleselection
pl-i
ctrlp
python-ggplot
orientation-changes
node-inspector
node-imagemagick
boost-test
dvcs
flash-cc
arangodb-php
confusion-matrix
dllexport
xslkey
fluidsynth
android-4.0
xgettext
legacy-code
jpf
hashalgorithm
objectbrowser
curljs
junit3
libavformat
pantheios
asp.net-routing
j-interop
graniteds
gears

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App