python


When does Python decode a byte string while reading a file?


I have a text file with keywords and I use
with open('filename.txt','r') as file:
list_of_words = [x.strip('\n') for x in file.readlines()
I get a:
UnicodeDecodeError : 'ascii' codec can't decode byte 0xc4 in position 5595: ordinal not in range(128)
on line 2
I understand the error. I don't understand why is it on line 2.
According to python docs: https://docs.python.org/3/library/functions.html#open
In text mode (the default, or when 't' is included in the mode
argument), the contents of the file are returned as str, the bytes
having been first decoded using a platform-dependent encoding or using
the specified encoding if given.
This means that while opening the file, the decoding process happens while opening the file and returned in 'file' variable.
Why do I get the error on line 2 then?
You seem to be confusing the file object returned by the open() call, for the actual process of reading from a file object.
Python decodes the contents of the file, as you read it. Opening a file object doesn't read any data from the file, it just creates a file object. No data is read from the file at that point, there are no bytes for Python to process yet.
In line 2 you actually read from the file, using the file.readlines() method. It is that method that tells the file object to fetch data from the filesystem (bytes) and decode those bytes. Only then can Python know that the data cannot actually be decoded as ASCII.
Opening the file doesn't actually examine the contents. Only when some of the file data is returned via one of its several methods that perform a read can the contents be decoded.
You don't get an error immediately because the file is just opened, not read from in line 1. Opening a file just involves acquiring a handle from the operating system to the file - no contents are being read.
Only when you call readlines, read, iterate over the file, or otherwise read from the file you actually get contents, initially as bytes. These bytes are then decoded, and only then found not to be a valid in the specified encoding.
If you're not specifying an encoding, python guesses it from the operating system configuration:
encoding is the name of the encoding used to decode or encode the (...). The default encoding is platform dependent (whatever locale.getpreferredencoding() returns)
It seems that on your system locale.getpreferredencoding() returns ASCII, and the file is not encoded in ASCII.
Simply specify the correct encoding:
with open('filename.txt', 'r', encoding='utf-8') as file:
list_of_words = [line.strip('\n') for line in file]
Actually this is just poorly worded explanation in documentation. Open function does not read any content it only returns file handle with specified mode and encoding that was passed to it from OS.

Related Links

def function python - Help me with syntax
After the installion of django-bootstrap-toolkit, to use the css class in bootstrap, but it didn't work
Simple instructions for installing python easy_install on OSX
Python: Multiple conditions -in a pattern- for if/while
How to set timeout for ProxyAgent?
Combining data coordinates with pixel coordinates
Search/Find functionality in QTableView
is it possible to combine a logical and a limit condition in a numpy array slice operator
numpy.rint not working as expected
Parse a webpage containing xml but throws an error
Strange response to font settings in 'matplotlib'
different results for standard deviation using numpy and R [duplicate]
Embed python modules in code
Performing a double integral over a matrix of limits
How to write to a read-only file in Python
cx_Oracle-like package for Clojure

Categories

HOME
numpy
ssas-2012
sed
boost-thread
wms
ubuntu-12.04
sbt-assembly
adsense
ext.net
bro
docker-windows
jframe
contact-form-7
desktop
dropbox
vuex
hosts
cs-cart
webmethods
cosmicmind
phoenix
medium-editor
quartus
google-people
multiple-columns
cocoa-touch
skmaps
guile
ui5
image-quality
unpack
solidworks
partial-application
forum
vertex-buffer
ifstream
shapes
pycparser
threshold
pace
xen
intellilock
dart-pub
nashorn
google-account
asmx
startapp
nsjsonserialization
gauss
livescribe
crystal-reports-8.5
pagefile
paas
ingres
svn-merge
coremidi
activeweb
mailcatcher
software-product-lines
mix
webkit2
pax-web
jackson-databind
brython
inotifypropertychanged
spring-lemon
mongo-c-driver
flexjson
altbeacon
portfolio
asymptote
self-hosting
getrusage
futuretask
mogrify
swift2.1
httpie
jcr-sql2
lexicographic
hidden-field
docopt
system.web
surrogate-key
boost-test
funkload
magic-numbers
dynamic-proxy
convex-polygon
goinstant
nscolor
swrl
ruboto
surveyor-gem
process-monitor
pys60
android-4.0
web-safe-fonts
subtract
file-exists
driving-directions
ubuntu-11.04
wchar
querystringparameter
handwriting
paintcomponent
authenticode
gethashcode
email-spec
mvccontrib-grid
clients
chdatastructures
visual-c++-2008-express
usability-testing
callgrind
soft-keyboard
opcodes
handheld

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App