python


Pandas: Efficient way to get first row with element that is smaller than a given value


I'm wondering if there's an efficient way to do this in pandas: Given a dataframe, what is the first row that is smaller than a given value? For example, given:
addr
0 4196656
1 4197034
2 4197075
3 4197082
4 4197134
What is the first value that is smaller than 4197080? I want it to return just the row with 4197075.
A solution would be to first filter by 4197080 and then take the last row, but that looks like to be an extremely slow O(N) operation (first building a dataframe and then taking its last row), while a binary search would take O(logN).
df.addr[ df.addr < 4197080].tail(1)
I timed it, and creating df.addr[ df.addr < 4197080] more or less takes the same as df.addr[ df.addr < 4197080].tail(1), strongly hinting that internally it's building an entire df first.
num = np.random.randint(0, 10**8, 10**6)
num.sort()
df = pd.DataFrame({'addr':num})
df = df.set_index('addr', drop=False)
df = df.sort_index()
Getting the first smaller value is very slow:
%timeit df.addr[ df.addr < 57830391].tail(1)
100 loops, best of 3: 7.9 ms per loop
Using lt improves things a bit:
%timeit df.lt(57830391)[-1:]
1000 loops, best of 3: 853 µs per loop
But still nowhere near as fast as a binary search:
%timeit bisect(num, 57830391, 0, len(num))
100000 loops, best of 3: 6.53 µs per loop
Is there any better way?
This requires 0.14.0
Note that the frame IS NOT SORTED.
In [16]: s = df['addr']
Find biggest value lower than required
In [18]: %timeit s[s<5783091]
100 loops, best of 3: 9.01 ms per loop
In [19]: %timeit s[s<5783091].nlargest(1)
100 loops, best of 3: 11 ms per loop
So this is faster than actuallying performing a full-sort, then indexing.
The .copy is to avoid biasing the inplace sort.
In [32]: x = np.random.randint(0, 10**8, 10**6)
In [33]: def f(x):
....: x.copy().sort()
....:
In [35]: %timeit f(x)
10 loops, best of 3: 67.2 ms per loop
If you are simply searching an ALREADY SORTED series, then use searchsorted. Note that you must use the numpy version (e.g. operate on .values. The series version will be defined in 0.14.1)
In [41]: %timeit s.values.searchsorted(5783091)
100000 loops, best of 3: 2.5 µs per loop

Related Links

Display gaps in dates with Python and Django
Python: How to shutdown a threaded HTTP server with persistent connections (how to kill readline() from another thread)?
New transport and reader type in Twisted
Resizing image with Python with locked aspect ratio
Single player 'pong' game
Facebook API non-interactive authorization/login
__getattr__ keeps returning None even when I attempt to return values
python foreign character in csv
How to pack python files and its dependencies in a single executable file?
Printing Variable names and contents as debugging tool; looking for emacs/Python shortcut
Cheking added file to upload python, pylons?
How to refer to the local module in Python?
Is close() necessary when using iterator on a Python file object [duplicate]
Django Admin “Edit Selection” Action?
How to change firefox proxy from webdriver?
Is it possible to hook up a more robust HTML parser to Python mechanize?

Categories

HOME
xbox-live
sidekiq
reverse-engineering
filterrific
amazon-swf
command
webdav
algorithmic-trading
java-home
vsftpd
deeplearning4j
dropbox
click
ndis
spinnaker
finite-group-theory
cpanel
x-frame-options
phoenix
cratedb
spring-cloud-config
sql-tuning
aspell
python-unicode
anychart
cython
shippo
backpropagation
statusbar
avplayeritem
onsen-ui
cocoa-touch
multiplayer
iup
nesc
winscp
forum
backup-strategies
mapguide
es-shell
image-compression
agent
equivalence
nashorn
pdftk
paho
cppunit
catia
encapsulation
nsjsonserialization
windows-firewall
slidesjs
volume
butterknife
static-code-analysis
python-hypothesis
sqlexception
lotus
vici
project-template
icefaces
audioqueue
qregexp
linkageerror
database-backups
ng2-material
mongoskin
kendo-combobox
selecteditem
gradle-release-plugin
unity5.3
sem
castle-windsor-3
guzzle6
jai
alphablending
suffix-array
getimagedata
bsp
ssms-addin
ifft
dbsetup
flash-cc
jquery-tabs
angulartics
convex-polygon
robospice
vmware-server
client-side-scripting
manage.py
prng
code-conversion
usn
pstree
cosm
windows-phone-7-emulator
userid
hashalgorithm
fileutils
goliath
boost-date-time
document-library

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App