python


When does Python decode a byte string while reading a file?


I have a text file with keywords and I use
with open('filename.txt','r') as file:
list_of_words = [x.strip('\n') for x in file.readlines()
I get a:
UnicodeDecodeError : 'ascii' codec can't decode byte 0xc4 in position 5595: ordinal not in range(128)
on line 2
I understand the error. I don't understand why is it on line 2.
According to python docs: https://docs.python.org/3/library/functions.html#open
In text mode (the default, or when 't' is included in the mode
argument), the contents of the file are returned as str, the bytes
having been first decoded using a platform-dependent encoding or using
the specified encoding if given.
This means that while opening the file, the decoding process happens while opening the file and returned in 'file' variable.
Why do I get the error on line 2 then?

You seem to be confusing the file object returned by the open() call, for the actual process of reading from a file object.
Python decodes the contents of the file, as you read it. Opening a file object doesn't read any data from the file, it just creates a file object. No data is read from the file at that point, there are no bytes for Python to process yet.
In line 2 you actually read from the file, using the file.readlines() method. It is that method that tells the file object to fetch data from the filesystem (bytes) and decode those bytes. Only then can Python know that the data cannot actually be decoded as ASCII.

Opening the file doesn't actually examine the contents. Only when some of the file data is returned via one of its several methods that perform a read can the contents be decoded.

You don't get an error immediately because the file is just opened, not read from in line 1. Opening a file just involves acquiring a handle from the operating system to the file - no contents are being read.
Only when you call readlines, read, iterate over the file, or otherwise read from the file you actually get contents, initially as bytes. These bytes are then decoded, and only then found not to be a valid in the specified encoding.
If you're not specifying an encoding, python guesses it from the operating system configuration:
encoding is the name of the encoding used to decode or encode the (...). The default encoding is platform dependent (whatever locale.getpreferredencoding() returns)
It seems that on your system locale.getpreferredencoding() returns ASCII, and the file is not encoded in ASCII.
Simply specify the correct encoding:
with open('filename.txt', 'r', encoding='utf-8') as file:
list_of_words = [line.strip('\n') for line in file]

Actually this is just poorly worded explanation in documentation. Open function does not read any content it only returns file handle with specified mode and encoding that was passed to it from OS.


Related Links

How to change data format in write function in Python?
How to extract columns of data from a flask-sqlalchemy query in a form that can be used with matplotlib
When returning an iterator in Python 3, should `yield from` be preferred over `return`?
Embedding Python Widgets into Website using Flask
How do I iterate over an array of objects and check if an attribute exists?
Django rest framework request.data raises Error
Music genre classification with sklearn: how to accurately evaluate different models
How python regex to find or search only the exact world/text
Initializing Multi-Dimensional List Python Reference Issue [duplicate]
Is there a tool to convert natural language text into form of predicates?
how to load a big train.csv for XGBoost
Using ipywidgets with plotly in jupyter notebook
Django searching function issue
Animating the Colormesh plot in python
Search entire table, for each element in a list
Subversion post-commit hook to trigger exporting repo to other directories

Categories

HOME
swift
rest
jsf
batch-processing
sass
npm
reportportal
mainframe
chaiscript
iis-7.5
avl-tree
slim-3
keystore
jboss-eap-7
pjsip
android-source
xul
tee
titan
cross-platform
ssms-2016
yeoman-generator
internet-explorer-8
android-fragmentactivity
facebook4j
extractor
tflearn
pdfsharp
sumo
rhmap
exponential
polymorphism
color-scheme
ksoap
equivalence
jquery-multidatespicker
s3cmd
microdata
tango
lftp
prototypejs
vcf
jags
shutdown
collapsingtoolbarlayout
typhoon
ruby-2.0
gitweb
dts
achievements
remoteapp
spinlock
financial
sun-codemodel
git-rebase
loose-typing
data-import
sem
risk-analysis
rspec3
uos
adobe-indesign
guzzle6
stackframe
magento-1.12
gmaps4rails
node-inspector
ghostdoc
suffix-array
colt
level
magic-numbers
dvcs
svcutil.exe
cling
fogbugz-api
dllexport
execvp
localtime
shortcuts
fluidsynth
pys60
pitch
vim-powerline
radchart
deploying
separation-of-concerns
gtk2hs
smooth
ihttphandler
pivotal-crm
gethashcode
pureftpd
email-spec
post-redirect-get
osx-leopard
front-controller
iphone-sdk-3.2
mvvm-foundation
savestate
inline-if





Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm