python


Why does the file_object.tell() gives the same byte for a file at different positions?


Just starting my way into python and I can't get around the basic file navigation methods.
When I read the tell() tutorial it states that it returns the position where I am currently sitting on my file (on bytes).
My reasoning is that each character of the file will add up to the byte coordinate, right? This would mean that after a new line, which is just a string of characters that is split on the \n character, my byte coordinate would change ... but that seems to be incorrect.
I generate a quick toy text file on bash
$ for i in {1..10}; do echo "# this is the "$i"th line" ; done > toy.txt
$ for i in {11..20}; do echo " this is the "$i"th line" ; done >> toy.txt
and now I will iterate through this file and print out the line number and on each cycle, the result of the tell() call. The # are there as to mark some lines that delimit blocks of the file, which I want to return (see below).
My guess is that the for loop is iterating over the file object first, reaching it's end and thus it remains always the same.
This is toy example, on my real problem the file is Gigs in length and by applying the same method I get the result of tell() in blocks of what I image are reflecting how the for loop iterated over the file object.
Is this correct? Could you please shed some light on the concepts I am missing?
My final goal is to be able to locate specific coordinates in a file and then in parallel process these huge files from distributed starting points which I cannot monitor in the way I am screening for them.
os.path.getsize("toy.txt")
451
fa = open("toy.txt")
fa.seek(0) # let's double check
fa.tell()
count = 0
for line in fa:
if line.startswith("#"):
print line ,
print "tell {} count {}".format(fa.tell(), count)
else:
if count < 32775:
print line,
print "tell {} count {}".format(fa.tell(), count)
count += 1
Output:
# this is the 1th line
tell 451 count 0
# this is the 2th line
tell 451 count 1
# this is the 3th line
tell 451 count 2
# this is the 4th line
tell 451 count 3
# this is the 5th line
tell 451 count 4
# this is the 6th line
tell 451 count 5
# this is the 7th line
tell 451 count 6
# this is the 8th line
tell 451 count 7
# this is the 9th line
tell 451 count 8
# this is the 10th line
tell 451 count 9
this is the 11th line
tell 451 count 10
this is the 12th line
tell 451 count 11
this is the 13th line
tell 451 count 12
this is the 14th line
tell 451 count 13
this is the 15th line
tell 451 count 14
this is the 16th line
tell 451 count 15
this is the 17th line
tell 451 count 16
this is the 18th line
tell 451 count 17
this is the 19th line
tell 451 count 18
this is the 20th line
tell 451 count 19
You are using a for loop to read the file line by line:
for line in fa:
Files don't normally do this; you read blobs of data, usually chunks. In order for Python to give you lines instead, you need to read until the next newline. Only, reading byte by byte to find newlines is not very efficient.
So a buffer is used; you read a large chunk, then find the newlines in that chunk and yield a line for each one you find. Once the buffer is exhausted, you read a new chunk.
Your file in not big enough to read more than one chunk; it is only 451 bytes small, while a buffer is usually measured in kilobytes. If you were to create a larger file, you'll see the file position jump in large steps as you iterate.
See the file.next documenation (next is the method responsible for producing the next line when iterating, what the for loop does):
In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer.
If you need to keep track of the absolute file position while looping over the lines, you'll have to use binary mode if on Windows (to prevent newline translation taking place), and keep track of the line lengths yourself:
position = 0
for line in fa:
position += len(line)
The alternative is to use the io library; this is the framework used in Python 3 to handle files. The file.tell() method takes the buffer into account and will produce an accurate file position even when iterating.
Take into account that when you use io.open() to open a file in text mode that you'll get unicode strings. In Python 2, you could just use binary mode (open with 'rb'), if you must have str bytestrings. In fact, only in binary mode will you be given access to IOBase.tell(), in textmode an exception is thrown:
>>> import io
>>> fa = io.open("toy.txt")
>>> next(fa)
u'# this is the 1th line\n'
>>> fa.tell()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: telling position disabled by next() call
In binary mode, you get accurate output for file.tell():
>>> import os.path
>>> os.path.getsize("toy.txt")
461
>>> fa = io.open("toy.txt", 'rb')
>>> for line in fa:
... if line.startswith("#"):
... print line ,
... print "tell {} count {}".format(fa.tell(), count)
... else:
... if count < 32775:
... print line,
... print "tell {} count {}".format(fa.tell(), count)
... count += 1
...
# this is the 1th line
tell 23 count 0
# this is the 2th line
tell 46 count 1
# this is the 3th line
tell 69 count 2
# this is the 4th line
tell 92 count 3
# this is the 5th line
tell 115 count 4
# this is the 6th line
tell 138 count 5
# this is the 7th line
tell 161 count 6
# this is the 8th line
tell 184 count 7
# this is the 9th line
tell 207 count 8
# this is the 10th line
tell 231 count 9
this is the 11th line
tell 254 count 10
this is the 12th line
tell 277 count 11
this is the 13th line
tell 300 count 12
this is the 14th line
tell 323 count 13
this is the 15th line
tell 346 count 14
this is the 16th line
tell 369 count 15
this is the 17th line
tell 392 count 16
this is the 18th line
tell 415 count 17
this is the 19th line
tell 438 count 18
this is the 20th line
tell 461 count 19
When you iterate over the file, it uses an internal buffer to minimize expensive IO operations, so the file isn't necessarily positioned at the last character the loop saw.

Related Links

Intellij python plugin debugging the script copied under the target directory
Create a list with repeated values with list comprehension
Is LASSO regression implemented in Statsmodels?
There is a duplicate line showed when calling a __init__ in Python script
trouble with mousewheel + scrollbars in tkinter
Python Cutting a string on a certain point
Ansible become_user error UnicodeEncodeError: 'ascii' codec can't encode character
Dynamic way to create new columns as a function of existing columns in pandas
polymorphic dispatch: distinguishing Python integers vs. floating-point numbers vs. strings
HTCondor output files: obtain created directory
getting select values with flask [duplicate]
Tensorflow tf.matmul example is incorrect?
drawing flower with python turtle
Python while loop iteration does not work
Rows not displaying properly in Tkinter GUI
Getting InvalidArgumentError in Tensorflow

Categories

HOME
xbox-live
cluster-computing
ipython
webrtc
wxwidgets
thunderbird-addon
formal-verification
google-sheets-api
snap.svg
filter
eval
getorgchart
cortex-a
automata
jacoco
bar-chart
flexlm
boolean-expression
clover
ip-camera
raphael
nmake
jquery-form-validator
hhvm
avplayeritem
bootstrap-typeahead
fabric
samsung-mobile
hana-studio
picturebox
helix-3d-toolkit
construct-2
modelandview
data-extraction
fabric-digits
modulo
gettext
geomesa
office365connectors
agent
rule
stdclass
archer
chown
rollback
workflow-foundation-4.5
greenhills
struts-layout
jexl
vst
freetype2
multiple-file-upload
brython
mercurial-hook
cmocka
servlet-3.0
difference
anythingslider
jta
windows-vista
code-readability
achartengine
bridge
callstack
guzzle6
lexicographic
viewflipper
proximity
left-recursion
sid
method-overriding
delphi-xe3
map-force
jstack
unison
roxygen
bigint
va-list
dynamic-binding
manage.py
trailing-slash
autostart
fotoware
mpmovieplayer
redirectstandardoutput
shim
rose-db-object
applicationcontext
mysql-error-1045
webkit-transform
ou
goliath
smooth
surefire
callgrind
ajaxpro

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html