python


Why does the file_object.tell() gives the same byte for a file at different positions?


Just starting my way into python and I can't get around the basic file navigation methods.
When I read the tell() tutorial it states that it returns the position where I am currently sitting on my file (on bytes).
My reasoning is that each character of the file will add up to the byte coordinate, right? This would mean that after a new line, which is just a string of characters that is split on the \n character, my byte coordinate would change ... but that seems to be incorrect.
I generate a quick toy text file on bash
$ for i in {1..10}; do echo "# this is the "$i"th line" ; done > toy.txt
$ for i in {11..20}; do echo " this is the "$i"th line" ; done >> toy.txt
and now I will iterate through this file and print out the line number and on each cycle, the result of the tell() call. The # are there as to mark some lines that delimit blocks of the file, which I want to return (see below).
My guess is that the for loop is iterating over the file object first, reaching it's end and thus it remains always the same.
This is toy example, on my real problem the file is Gigs in length and by applying the same method I get the result of tell() in blocks of what I image are reflecting how the for loop iterated over the file object.
Is this correct? Could you please shed some light on the concepts I am missing?
My final goal is to be able to locate specific coordinates in a file and then in parallel process these huge files from distributed starting points which I cannot monitor in the way I am screening for them.
os.path.getsize("toy.txt")
451
fa = open("toy.txt")
fa.seek(0) # let's double check
fa.tell()
count = 0
for line in fa:
if line.startswith("#"):
print line ,
print "tell {} count {}".format(fa.tell(), count)
else:
if count < 32775:
print line,
print "tell {} count {}".format(fa.tell(), count)
count += 1
Output:
# this is the 1th line
tell 451 count 0
# this is the 2th line
tell 451 count 1
# this is the 3th line
tell 451 count 2
# this is the 4th line
tell 451 count 3
# this is the 5th line
tell 451 count 4
# this is the 6th line
tell 451 count 5
# this is the 7th line
tell 451 count 6
# this is the 8th line
tell 451 count 7
# this is the 9th line
tell 451 count 8
# this is the 10th line
tell 451 count 9
this is the 11th line
tell 451 count 10
this is the 12th line
tell 451 count 11
this is the 13th line
tell 451 count 12
this is the 14th line
tell 451 count 13
this is the 15th line
tell 451 count 14
this is the 16th line
tell 451 count 15
this is the 17th line
tell 451 count 16
this is the 18th line
tell 451 count 17
this is the 19th line
tell 451 count 18
this is the 20th line
tell 451 count 19

You are using a for loop to read the file line by line:
for line in fa:
Files don't normally do this; you read blobs of data, usually chunks. In order for Python to give you lines instead, you need to read until the next newline. Only, reading byte by byte to find newlines is not very efficient.
So a buffer is used; you read a large chunk, then find the newlines in that chunk and yield a line for each one you find. Once the buffer is exhausted, you read a new chunk.
Your file in not big enough to read more than one chunk; it is only 451 bytes small, while a buffer is usually measured in kilobytes. If you were to create a larger file, you'll see the file position jump in large steps as you iterate.
See the file.next documenation (next is the method responsible for producing the next line when iterating, what the for loop does):
In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer.
If you need to keep track of the absolute file position while looping over the lines, you'll have to use binary mode if on Windows (to prevent newline translation taking place), and keep track of the line lengths yourself:
position = 0
for line in fa:
position += len(line)
The alternative is to use the io library; this is the framework used in Python 3 to handle files. The file.tell() method takes the buffer into account and will produce an accurate file position even when iterating.
Take into account that when you use io.open() to open a file in text mode that you'll get unicode strings. In Python 2, you could just use binary mode (open with 'rb'), if you must have str bytestrings. In fact, only in binary mode will you be given access to IOBase.tell(), in textmode an exception is thrown:
>>> import io
>>> fa = io.open("toy.txt")
>>> next(fa)
u'# this is the 1th line\n'
>>> fa.tell()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: telling position disabled by next() call
In binary mode, you get accurate output for file.tell():
>>> import os.path
>>> os.path.getsize("toy.txt")
461
>>> fa = io.open("toy.txt", 'rb')
>>> for line in fa:
... if line.startswith("#"):
... print line ,
... print "tell {} count {}".format(fa.tell(), count)
... else:
... if count < 32775:
... print line,
... print "tell {} count {}".format(fa.tell(), count)
... count += 1
...
# this is the 1th line
tell 23 count 0
# this is the 2th line
tell 46 count 1
# this is the 3th line
tell 69 count 2
# this is the 4th line
tell 92 count 3
# this is the 5th line
tell 115 count 4
# this is the 6th line
tell 138 count 5
# this is the 7th line
tell 161 count 6
# this is the 8th line
tell 184 count 7
# this is the 9th line
tell 207 count 8
# this is the 10th line
tell 231 count 9
this is the 11th line
tell 254 count 10
this is the 12th line
tell 277 count 11
this is the 13th line
tell 300 count 12
this is the 14th line
tell 323 count 13
this is the 15th line
tell 346 count 14
this is the 16th line
tell 369 count 15
this is the 17th line
tell 392 count 16
this is the 18th line
tell 415 count 17
this is the 19th line
tell 438 count 18
this is the 20th line
tell 461 count 19

When you iterate over the file, it uses an internal buffer to minimize expensive IO operations, so the file isn't necessarily positioned at the last character the loop saw.


Related Links

Printing out a word if it has a specific letter in it? [closed]
Django-cities exits with “killed”
lookahead algorithm in Python
flask + gunicorn, make a redirect stick to the same worker thread
What is the difference between t_ignore, pass and t.lexer.skip() in ply.lex?
how do i pass groupby elements as lists to a function in python/pandas
Numpy argwhere inequality conditions
How CMD.exe recognize an extension need program?
How to edit data with django-tables2 in a frontend?
While loop giving different return and print results?
Trouble understanding a strange Python import
Reading the binary file shows unreadable characters
Dynamically re-sizing images in a GStreamer pipeline in python
efficient expanding linregress
Passing python dictionary to a class changes the value of dictionary [duplicate]
Convert list into correct input for polyfit [closed]

Categories

HOME
symfony
jar
drupal-7
label
visualization
docker-swarm
transparent
bibtex
search-engine
is-empty
vlc
recyclerview
ext.net
nodemailer
deeplearning4j
google-plus
rfid
cortex-a
game-physics
phoenix
squarespace
react-leaflet
eclipse-luna
lenskit
ejs
statusbar
jquery-scrollify
jquery-waypoints
tfs2013
codelite
pentaho-report-designer
system.data.sqlite
pygooglechart
backup-strategies
recurrence-relation
google-guava-cache
color-scheme
boilerplate
jboss-esb
nodeclipse
elfinder
android-preferences
web-development-server
stereo-3d
constants
taglib
3scale
walmart-electrode
seamless-immutable
jenkins-jira-trigger
rhel6
windows-azure-pack
tuxedo
nofollow
nssplitview
group-concat
portfolio
getrusage
android-cursoradapter
vine
calibration
pylearn
jazz
sysfs
cloudpebble
ultrawingrid
dmp
node-imagemagick
dalekjs
dto
code-testing
bsp
map-force
flash-cc
dynamic-proxy
sanitization
sim900
convex-polygon
ruboto
argb
google-closure-library
cgimageref
dynamics-ax-2009
arbor.js
abnf
android-contextmenu
cbcentralmanager
attachevent
vim-powerline
gjs
userid
nsindexpath
serp
noir
squeel
email-spec
libavformat
surefire
scala-2.8
google-instant
plinq





Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm