python


Why does the file_object.tell() gives the same byte for a file at different positions?


Just starting my way into python and I can't get around the basic file navigation methods.
When I read the tell() tutorial it states that it returns the position where I am currently sitting on my file (on bytes).
My reasoning is that each character of the file will add up to the byte coordinate, right? This would mean that after a new line, which is just a string of characters that is split on the \n character, my byte coordinate would change ... but that seems to be incorrect.
I generate a quick toy text file on bash
$ for i in {1..10}; do echo "# this is the "$i"th line" ; done > toy.txt
$ for i in {11..20}; do echo " this is the "$i"th line" ; done >> toy.txt
and now I will iterate through this file and print out the line number and on each cycle, the result of the tell() call. The # are there as to mark some lines that delimit blocks of the file, which I want to return (see below).
My guess is that the for loop is iterating over the file object first, reaching it's end and thus it remains always the same.
This is toy example, on my real problem the file is Gigs in length and by applying the same method I get the result of tell() in blocks of what I image are reflecting how the for loop iterated over the file object.
Is this correct? Could you please shed some light on the concepts I am missing?
My final goal is to be able to locate specific coordinates in a file and then in parallel process these huge files from distributed starting points which I cannot monitor in the way I am screening for them.
os.path.getsize("toy.txt")
451
fa = open("toy.txt")
fa.seek(0) # let's double check
fa.tell()
count = 0
for line in fa:
if line.startswith("#"):
print line ,
print "tell {} count {}".format(fa.tell(), count)
else:
if count < 32775:
print line,
print "tell {} count {}".format(fa.tell(), count)
count += 1
Output:
# this is the 1th line
tell 451 count 0
# this is the 2th line
tell 451 count 1
# this is the 3th line
tell 451 count 2
# this is the 4th line
tell 451 count 3
# this is the 5th line
tell 451 count 4
# this is the 6th line
tell 451 count 5
# this is the 7th line
tell 451 count 6
# this is the 8th line
tell 451 count 7
# this is the 9th line
tell 451 count 8
# this is the 10th line
tell 451 count 9
this is the 11th line
tell 451 count 10
this is the 12th line
tell 451 count 11
this is the 13th line
tell 451 count 12
this is the 14th line
tell 451 count 13
this is the 15th line
tell 451 count 14
this is the 16th line
tell 451 count 15
this is the 17th line
tell 451 count 16
this is the 18th line
tell 451 count 17
this is the 19th line
tell 451 count 18
this is the 20th line
tell 451 count 19
You are using a for loop to read the file line by line:
for line in fa:
Files don't normally do this; you read blobs of data, usually chunks. In order for Python to give you lines instead, you need to read until the next newline. Only, reading byte by byte to find newlines is not very efficient.
So a buffer is used; you read a large chunk, then find the newlines in that chunk and yield a line for each one you find. Once the buffer is exhausted, you read a new chunk.
Your file in not big enough to read more than one chunk; it is only 451 bytes small, while a buffer is usually measured in kilobytes. If you were to create a larger file, you'll see the file position jump in large steps as you iterate.
See the file.next documenation (next is the method responsible for producing the next line when iterating, what the for loop does):
In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer.
If you need to keep track of the absolute file position while looping over the lines, you'll have to use binary mode if on Windows (to prevent newline translation taking place), and keep track of the line lengths yourself:
position = 0
for line in fa:
position += len(line)
The alternative is to use the io library; this is the framework used in Python 3 to handle files. The file.tell() method takes the buffer into account and will produce an accurate file position even when iterating.
Take into account that when you use io.open() to open a file in text mode that you'll get unicode strings. In Python 2, you could just use binary mode (open with 'rb'), if you must have str bytestrings. In fact, only in binary mode will you be given access to IOBase.tell(), in textmode an exception is thrown:
>>> import io
>>> fa = io.open("toy.txt")
>>> next(fa)
u'# this is the 1th line\n'
>>> fa.tell()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: telling position disabled by next() call
In binary mode, you get accurate output for file.tell():
>>> import os.path
>>> os.path.getsize("toy.txt")
461
>>> fa = io.open("toy.txt", 'rb')
>>> for line in fa:
... if line.startswith("#"):
... print line ,
... print "tell {} count {}".format(fa.tell(), count)
... else:
... if count < 32775:
... print line,
... print "tell {} count {}".format(fa.tell(), count)
... count += 1
...
# this is the 1th line
tell 23 count 0
# this is the 2th line
tell 46 count 1
# this is the 3th line
tell 69 count 2
# this is the 4th line
tell 92 count 3
# this is the 5th line
tell 115 count 4
# this is the 6th line
tell 138 count 5
# this is the 7th line
tell 161 count 6
# this is the 8th line
tell 184 count 7
# this is the 9th line
tell 207 count 8
# this is the 10th line
tell 231 count 9
this is the 11th line
tell 254 count 10
this is the 12th line
tell 277 count 11
this is the 13th line
tell 300 count 12
this is the 14th line
tell 323 count 13
this is the 15th line
tell 346 count 14
this is the 16th line
tell 369 count 15
this is the 17th line
tell 392 count 16
this is the 18th line
tell 415 count 17
this is the 19th line
tell 438 count 18
this is the 20th line
tell 461 count 19
When you iterate over the file, it uses an internal buffer to minimize expensive IO operations, so the file isn't necessarily positioned at the last character the loop saw.

Related Links

Python HTTPS Login to account to scrape data, is this bad practice?
Celery, redis and ConnectionPool
Numpy Choose Elements from 2 arrays
Selenium Python iterate over a table of rows it is stopping at the first row
How to add a variable of a method of a class in aonther program in python?
Django : Getting error while removing django.contrib.sites from INSTALLED_APPS
Can I scrape a html page which is local machine using scrapy?
how to get request HTTP headers in soaplib views file?
Hive transform query runs slow with ORC file
What if I *really* need to escape quotes for an SQL script?
how to get the different IDs for different webdriver tabs
Running docker in subprocess on Python PyCharm
Django- NoReverseMatch at /login/here/
Threading class running one long function to totally stop if required
scipy does not load currectly in pycharm
Forming Django queryset for one-to-many relation

Categories

HOME
jdbc
oauth
amazon-swf
qpython3
command
transparent
mainframe
laravel-4
dataframe
codeblocks
ssr
game-maker-studio-1.4
wysiwyg
dryioc
x-frame-options
line-api
sendkeys
bar-chart
marathon
deb
embedly
countif
onsen-ui
source-maps
badge
non-deterministic
visual-c++-2017
ui5
forum
websauna
dhtmlx-scheduler
pep8-assembly
getjson
shapes
nsurlconnection
node-horseman
ios-ui-automation
jquery-cycle2
shibboleth
active-model-serializers
google-account
fqdn
paas
jszip
teiid
autoresize
libtiff.net
debugdiag
glkit
radians
nofollow
boost-hana
spatial-query
achievements
squirrel
mongo-c-driver
amazon-kcl
yoothemes
cmocka
portfolio
license-key
android-viewholder
data-import
eyeql
js-cookie
visual-studio-code
webhdfs
team-build
graphical-logo
hana-xs
hidden-field
mov
author
boost-test
pic24
slick-2.0
roxygen
entity-framework-4.1
justgage
manage.py
sharpmap
qtgui
trailing-slash
lync-server-2010
cosm
legacy-code
responsetext
wchar
electronic-signature
nagle
google-instant
plinq
longjmp
winsnmp
dirty-data

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App