python


Efficiently replace numbers in list of strings with a token


I have lists of strings where some of the strings are integers. I'd like to find a way to quickly replace numbers over 100 with a token based off the length of the numbers.
['foo', 'bar', '3333'] -> ['foo', 'bar', '99994']
I will be performing this operation millions of times over lists of length around 100. The pure python method I've come up with is as follows:
def quash_large_numbers(tokens, threshold=100):
def is_int(s):
try:
int(s)
return True
except ValueError:
return False
BIG_NUMBER_TOKEN = '9999%d'
tokens_no_high_nums = [BIG_NUMBER_TOKEN % len(t) if is_int(t) and int(t) > threshold else t
for t in tokens]
return tokens_no_high_nums
I was trying to see if I could do this more quickly via pandas, but it's much slower for small lists, I imagine from all the overhead from converting back and forth from series to list.
def pd_quash_large_numbers(tokens, threshold=100):
BIG_NUMBER_TOKEN = 9999
tokens_ser = pd.Series(tokens)
int_tokens = pd.to_numeric(tokens_ser, errors='coerce')
tokens_over_threshold = int_tokens > threshold
str_lengths = tokens_ser[tokens_over_threshold].str.len().astype(str)
tokens_ser[tokens_over_threshold] = BIG_NUMBER_TOKEN + str_lengths
return tokens_ser.tolist()
Is there a more efficient way I'm missing here? Possibly via cython?
I got a nice speedup by counting text digits instead of doing any conversion. This test program knocked it down by nearly 80%. It runs the original code, my textual inspection code and piRSquared's numpy code. Let the best code win!
import time
# a thousand 100 item long lists to test
test_data = [['foo', 'bar', '3333'] * 33 for _ in range(1000)]
def quash_large_numbers(tokens, threshold=100):
def is_int(s):
try:
int(s)
return True
except ValueError:
return False
BIG_NUMBER_TOKEN = '9999%d'
tokens_no_high_nums = [BIG_NUMBER_TOKEN % len(t) if is_int(t) and int(t) > threshold else t
for t in tokens]
return tokens_no_high_nums
start = time.time()
result = [quash_large_numbers(tokens, 100) for tokens in test_data]
print('original', time.time() - start)
def quash(somelist, digits):
return [text if len(text) <= digits or not text.isdigit() else '9999' + str(len(text)) for text in somelist]
start = time.time()
result = [quash(item, 2) for item in test_data]
print('textual ', time.time() - start)
import numpy as np
def np_quash(somelist, threshold=100):
v = np.array(somelist)
r = np.arange(v.size)
m = np.core.defchararray.isdigit(v)
g = v[m].astype(int) > threshold
i = r[m][g]
t = np.array(['9999{}'.format(len(x)) for x in v[i].tolist()])
v = v.astype(t.dtype)
v[i] = t
return v.tolist()
start = time.time()
result = [np_quash(item, 100) for item in test_data]
print('numpy ', time.time() - start)
Results
original 0.6143333911895752
textual 0.12842845916748047
numpy 0.3644399642944336
new quicker answer
v = np.array(['foo', 'bar', '3333'])
r = np.arange(v.size)
m = np.core.defchararray.isdigit(v)
g = v[m].astype(int) > 100
i = r[m][g]
t = np.array(['9999{}'.format(len(x)) for x in v[i].tolist()])
v = v.astype(t.dtype)
v[i] = t
v.tolist()
['foo', 'bar', '99994']
old answer
s = pd.Series(['foo', 'bar', '3333'])
s.loc[pd.to_numeric(s, 'coerce') > 100] = s.str.len().map('9999{}'.format)
s
0 foo
1 bar
2 99994
dtype: object
Or
s.tolist()
['foo', 'bar', '99994']

Related Links

Odoo - overriding old api method with new api?
Pylint warning - W5304(Missing-Test-Class)
In terms of efficiency/quickly rejecting things, does it matter if your if statements are on the same line?
QComboBox drop-down list adding unnecessary scroll bar
Specify color type for bar graph pylab
prettifying a part of the html doc using beautifulsoup
Django concatenate two querysets for same model
How can I validate input to accept only binary numbers?
Custom route predicates in Pyramid
Why is my query for a structured property failing with BadFilterError?
Simple Audio Units Host to drive an Audio Units Instrument
kivy language cumbersomeness and rationale behind it
detect if variable is of sympy type
Add markers on line plot from another vector
Does asyncio.wait return only after all done_callbacks were called?
How to parse a htmlpage with lxml with <br /> screwing up?

Categories

HOME
makefile
laravel-5
openssl
xamarin.android
computer-vision
hana
survival-analysis
is-empty
specflow
propertygrid
quill
iis-7.5
spring-cloud-contract
pygame
shader
height
cortex-a
cs-cart
dlib
nuxt.js
future
zoomcharts
google-openid
backpropagation
sonarqube-msbuild-runner
flexboxgrid
scalajs-react
hanami
maximo
weinre
websauna
sharp
superscript
postgresql-9.2
rhomobile
css-counter
nunit-3.0
oscommerce
forecasting
rhel.net
android-download-manager
freetts
visualstudio.testtools
s3cmd
lexical-analysis
finite-element-analysis
archer
tiddlywiki
shutdown
rhel6
udev
slidesjs
crash-reports
launch
paas
garrys-mod
efxclipse
build-process
android-navigationview
aescryptoserviceprovider
firepath
try-finally
webkit2
asteriskami
nssplitview
group-concat
yoothemes
magento-1.4
throughput
java.util.calendar
document-oriented-db
subresource-integrity
cpu-cores
pisa
obfuscar
line-numbers
ibmsbt
sortable
iostream
dto
colt
edit-in-place
internal
chrome-for-android
hamsterdb
facebook-chat
radchart
expression-evaluation
responsetext
supersized
viewdidload
msgbox
wchar
datacontract
nvelocity
out-of-browser
camtasia
iphone-sdk-3.2
asp.net-mvc-controller
plinq
handheld
weborb
usergroups
interface-design

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App