python


Pandas data conversion


I have the following data in a Pandas dataframe:
AIRPORT
EWR|JAX
EWR|BHX
EWR|BHX
EWR|BHX
EWR|BHX
...
Is there a dynamic way to convert this to:
AIRPORT EWR JAX BHX
EWR|JAX Y Y NULL
EWR|BHX Y NULL Y
and so on. I know how to do this if I want to count the hard coded values
df.assign(EWR = lambda x: x.TYPE.apply(lambda y: y.split('|').count('EWR')))
but I'm hoping not to have to write this code for each airport.
You can use .str accessor and get_dummies, then using assign with dictionary unpacking to create the additional columns in your dataframe. And, lastly replace to change those 0's and 1's to your str, bool, and nan of choice.
df_out = df.assign(**df.AIRPORT.str.get_dummies().replace({1:'Y',0:np.nan}))
print(df_out)
Output:
AIRPORT BHX EWR JAX
0 EWR|JAX NaN Y Y
1 EWR|BHX Y Y NaN
2 EWR|BHX Y Y NaN
3 EWR|BHX Y Y NaN
4 EWR|BHX Y Y NaN
pandas only with str.get_dummies
dummies = df.AIRPORT.str.get_dummies()
df.join(
dummies * pd.Series('Y', dummies.columns)
).replace('', np.nan)
AIRPORT BHX EWR JAX
0 EWR|JAX nan Y Y
1 EWR|BHX Y Y nan
2 EWR|BHX Y Y nan
3 EWR|BHX Y Y nan
4 EWR|BHX Y Y nan
pandas & numpy with np.where
dummies = df.AIRPORT.str.get_dummies()
d1 = pd.DataFrame(
np.where(dummies.values == 1, 'Y', np.nan),
dummies.index, dummies.columns
)
d2 = df.join(d1)
print(d2)
AIRPORT BHX EWR JAX
0 EWR|JAX nan Y Y
1 EWR|BHX Y Y nan
2 EWR|BHX Y Y nan
3 EWR|BHX Y Y nan
4 EWR|BHX Y Y nan
Timing
small data
%%timeit
df.join(
df.AIRPORT.str.get_dummies() * pd.Series('Y', dummies.columns)
).replace('', np.nan)
100 loops, best of 3: 2.31 ms per loop
%timeit df.assign(**df.AIRPORT.str.get_dummies().replace({1:'Y',0:np.nan}))
100 loops, best of 3: 2.78 ms per loop
%%timeit
dummies = df.AIRPORT.str.get_dummies()
d1 = pd.DataFrame(
np.where(dummies.values == 1, 'Y', np.nan),
dummies.index, dummies.columns
)
df.join(d1)
1000 loops, best of 3: 1.65 ms per loop
large data
from string import ascii_uppercase
np.random.seed([3,1415])
source = pd.DataFrame(
np.random.choice(list(ascii_uppercase), [100, 3])
).sum(1).unique()
df = pd.DataFrame(
np.random.choice(source, [10000, 2]), columns=['A', 'B']
).query('A != B').apply('|'.join, 1).to_frame('AIRPORT')
%%timeit
dummies = df.AIRPORT.str.get_dummies()
df.join(
dummies * pd.Series('Y', dummies.columns)
).replace('', np.nan)
1 loop, best of 3: 594 ms per loop
%timeit df.assign(**df.AIRPORT.str.get_dummies().replace({1:'Y',0:np.nan}))
1 loop, best of 3: 629 ms per loop
%%timeit
dummies = df.AIRPORT.str.get_dummies()
d1 = pd.DataFrame(
np.where(dummies.values == 1, 'Y', np.nan),
dummies.index, dummies.columns
)
df.join(d1)
1 loop, best of 3: 592 ms per loop

Related Links

How do I make my script take only numeric inputs without screwing it up
Incorrect output while reading text file in Python
PhantomJS - Permission Denied
Combining image RGB channels
Python input validation and edge case handling
Struggling with making a Python module accessible via PyPi
Finding the minimum and maximum of a list of arrays
Accessing Google Drive Spreadsheets with Python Gspread
py.test & pytest on Raspberry Pi : Differences ?
Find maximum of column for each business quarter pandas
placeholder functions in sympy
Django: how to chain 2 add() calls in 1 create()?
How to specify large integer literals in a readable way?
How to do `PUT` on Amazon S3 using Python Requests
Untangle re.findall capturing groups: 'list' object has no attribute 'join'
Twisted equivalent of ThreadPoolExecutor (in Scrapy pipeline)

Categories

HOME
tensorflow
sidekiq
payment-gateway
openssl
winapi
textwatcher
google-contacts
actionscript
playframework
travis-ci
csvhelper
couchdb-2.0
propertygrid
tumblr
google-spreadsheet-api
eclipse-cdt
abcpdf
cpanel
symfony-forms
equalizer
lstm
titan
atlassian-plugin-sdk
propel2
android-fragmentactivity
cvs2svn
plsqldeveloper
prediction
servicemix
remove-method
sumo
web-frontend
hana-studio
pygooglechart
jackson-dataformat-csv
xor
edb
idl
ksoap
codesys
elfinder
parent
restful-url
floating-accuracy
bayesian-networks
php-ews
greenhills
text-classification
ruby-on-rails-2
hpcc
onresume
svn-merge
vst
fps
freefem++
galaxy
android-cursoradapter
schtasks.exe
hyperthreading
kendo-menu
mptcp
consensus
jcr-sql2
activity-streams
rhel5
cpu-speed
node-imagemagick
genetic-programming
mfmailcomposeviewcontroll
colt
wss
level
httpcontext
crystal-reports-10
circos
exiv2
bignum
nservicebus4
nimrod
iconv
reporting-tools
vertical-scrolling
pitch
datarepeater
subtract
ninject-extensions
z-machine
flexicious
wchar
electronic-signature
ou
executable-format
windows-live-id
zend-form-element
msn
carbide
graph-layout
windows-controls
winsnmp
inline-if
handheld
data-execution-prevention
hardware-infrastructure

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App