python


Why does the file_object.tell() gives the same byte for a file at different positions?


Just starting my way into python and I can't get around the basic file navigation methods.
When I read the tell() tutorial it states that it returns the position where I am currently sitting on my file (on bytes).
My reasoning is that each character of the file will add up to the byte coordinate, right? This would mean that after a new line, which is just a string of characters that is split on the \n character, my byte coordinate would change ... but that seems to be incorrect.
I generate a quick toy text file on bash
$ for i in {1..10}; do echo "# this is the "$i"th line" ; done > toy.txt
$ for i in {11..20}; do echo " this is the "$i"th line" ; done >> toy.txt
and now I will iterate through this file and print out the line number and on each cycle, the result of the tell() call. The # are there as to mark some lines that delimit blocks of the file, which I want to return (see below).
My guess is that the for loop is iterating over the file object first, reaching it's end and thus it remains always the same.
This is toy example, on my real problem the file is Gigs in length and by applying the same method I get the result of tell() in blocks of what I image are reflecting how the for loop iterated over the file object.
Is this correct? Could you please shed some light on the concepts I am missing?
My final goal is to be able to locate specific coordinates in a file and then in parallel process these huge files from distributed starting points which I cannot monitor in the way I am screening for them.
os.path.getsize("toy.txt")
451
fa = open("toy.txt")
fa.seek(0) # let's double check
fa.tell()
count = 0
for line in fa:
if line.startswith("#"):
print line ,
print "tell {} count {}".format(fa.tell(), count)
else:
if count < 32775:
print line,
print "tell {} count {}".format(fa.tell(), count)
count += 1
Output:
# this is the 1th line
tell 451 count 0
# this is the 2th line
tell 451 count 1
# this is the 3th line
tell 451 count 2
# this is the 4th line
tell 451 count 3
# this is the 5th line
tell 451 count 4
# this is the 6th line
tell 451 count 5
# this is the 7th line
tell 451 count 6
# this is the 8th line
tell 451 count 7
# this is the 9th line
tell 451 count 8
# this is the 10th line
tell 451 count 9
this is the 11th line
tell 451 count 10
this is the 12th line
tell 451 count 11
this is the 13th line
tell 451 count 12
this is the 14th line
tell 451 count 13
this is the 15th line
tell 451 count 14
this is the 16th line
tell 451 count 15
this is the 17th line
tell 451 count 16
this is the 18th line
tell 451 count 17
this is the 19th line
tell 451 count 18
this is the 20th line
tell 451 count 19
You are using a for loop to read the file line by line:
for line in fa:
Files don't normally do this; you read blobs of data, usually chunks. In order for Python to give you lines instead, you need to read until the next newline. Only, reading byte by byte to find newlines is not very efficient.
So a buffer is used; you read a large chunk, then find the newlines in that chunk and yield a line for each one you find. Once the buffer is exhausted, you read a new chunk.
Your file in not big enough to read more than one chunk; it is only 451 bytes small, while a buffer is usually measured in kilobytes. If you were to create a larger file, you'll see the file position jump in large steps as you iterate.
See the file.next documenation (next is the method responsible for producing the next line when iterating, what the for loop does):
In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer.
If you need to keep track of the absolute file position while looping over the lines, you'll have to use binary mode if on Windows (to prevent newline translation taking place), and keep track of the line lengths yourself:
position = 0
for line in fa:
position += len(line)
The alternative is to use the io library; this is the framework used in Python 3 to handle files. The file.tell() method takes the buffer into account and will produce an accurate file position even when iterating.
Take into account that when you use io.open() to open a file in text mode that you'll get unicode strings. In Python 2, you could just use binary mode (open with 'rb'), if you must have str bytestrings. In fact, only in binary mode will you be given access to IOBase.tell(), in textmode an exception is thrown:
>>> import io
>>> fa = io.open("toy.txt")
>>> next(fa)
u'# this is the 1th line\n'
>>> fa.tell()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: telling position disabled by next() call
In binary mode, you get accurate output for file.tell():
>>> import os.path
>>> os.path.getsize("toy.txt")
461
>>> fa = io.open("toy.txt", 'rb')
>>> for line in fa:
... if line.startswith("#"):
... print line ,
... print "tell {} count {}".format(fa.tell(), count)
... else:
... if count < 32775:
... print line,
... print "tell {} count {}".format(fa.tell(), count)
... count += 1
...
# this is the 1th line
tell 23 count 0
# this is the 2th line
tell 46 count 1
# this is the 3th line
tell 69 count 2
# this is the 4th line
tell 92 count 3
# this is the 5th line
tell 115 count 4
# this is the 6th line
tell 138 count 5
# this is the 7th line
tell 161 count 6
# this is the 8th line
tell 184 count 7
# this is the 9th line
tell 207 count 8
# this is the 10th line
tell 231 count 9
this is the 11th line
tell 254 count 10
this is the 12th line
tell 277 count 11
this is the 13th line
tell 300 count 12
this is the 14th line
tell 323 count 13
this is the 15th line
tell 346 count 14
this is the 16th line
tell 369 count 15
this is the 17th line
tell 392 count 16
this is the 18th line
tell 415 count 17
this is the 19th line
tell 438 count 18
this is the 20th line
tell 461 count 19
When you iterate over the file, it uses an internal buffer to minimize expensive IO operations, so the file isn't necessarily positioned at the last character the loop saw.

Related Links

While loop inside for loop?
Serialize multiple models in a single view
Iterating over a list adding attributes to an object in Python [duplicate]
Fourier Transformation vs Numpy FFT
pandas to_html no value representation
Using Pandas To Find the Number Of Periods Since the Rolling High
List Comprehension For Loop + Ternary Operation For Loop?
Pass variable to multiple templates in pybottle
Next and Before Links for a django paginated query
Python - String of Digits to Integer using Recursion?
Weird type error arising when I add to a database using SQL Alchemy
Python for loop causes system crash
Class object variable = none
Print last <td> in beautiful soup
Break python regular expression with escape into multiple lines
Django querying with Through table properties

Categories

HOME
sql-server-2008
spring
sql-server
cakephp-3.x
meshlab
wmic
drupal-7
visualization
windows-store-apps
jscript
vlc
phaser-framework
dataframe
ng-admin
contact-form-7
desktop
dropbox
octopus-deploy
uiactivityviewcontroller
iso
gatsby
data-synchronization
qt-installer
multiple-monitors
scaling
styles
raphael
kitematic
spring-test
fabric
data-cleansing
unpack
mayavi
fileinfo
blackboard
bxslider
hibernate-cache
encase
crt
strophe
getline
geopy
modelandview
recurrence-relation
von-neumann
active-model-serializers
productivity
ttcn
resuming-training
cppunit
facebook-chatbot
automator
archer
livescribe
launch
collapsingtoolbarlayout
objloader
c64
alter
glkit
vst
pydio
gameanalytics
pax-web
axes
cmocka
acm
void
android-viewholder
oai
response-headers
playscape
crash-dumps
ibm-data-studio
gulp-livereload
scrollspy
searchkick
hg-git
cloudpebble
candidate-key
dmp
smartystreets
mfmailcomposeviewcontroll
ssms-addin
voldemort
map-force
help-viewer
reserved-words
lov
goinstant
dynamic-binding
swrl
argb
sharpmap
process-monitor
point-sprites
wchar
ou
scraperwiki
datacontract
handwriting
opengl-es-lighting
phpcrawl
drawtext
openvg
mvvm-foundation
internals
msf

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App