python


When does Python decode a byte string while reading a file?


I have a text file with keywords and I use
with open('filename.txt','r') as file:
list_of_words = [x.strip('\n') for x in file.readlines()
I get a:
UnicodeDecodeError : 'ascii' codec can't decode byte 0xc4 in position 5595: ordinal not in range(128)
on line 2
I understand the error. I don't understand why is it on line 2.
According to python docs: https://docs.python.org/3/library/functions.html#open
In text mode (the default, or when 't' is included in the mode
argument), the contents of the file are returned as str, the bytes
having been first decoded using a platform-dependent encoding or using
the specified encoding if given.
This means that while opening the file, the decoding process happens while opening the file and returned in 'file' variable.
Why do I get the error on line 2 then?
You seem to be confusing the file object returned by the open() call, for the actual process of reading from a file object.
Python decodes the contents of the file, as you read it. Opening a file object doesn't read any data from the file, it just creates a file object. No data is read from the file at that point, there are no bytes for Python to process yet.
In line 2 you actually read from the file, using the file.readlines() method. It is that method that tells the file object to fetch data from the filesystem (bytes) and decode those bytes. Only then can Python know that the data cannot actually be decoded as ASCII.
Opening the file doesn't actually examine the contents. Only when some of the file data is returned via one of its several methods that perform a read can the contents be decoded.
You don't get an error immediately because the file is just opened, not read from in line 1. Opening a file just involves acquiring a handle from the operating system to the file - no contents are being read.
Only when you call readlines, read, iterate over the file, or otherwise read from the file you actually get contents, initially as bytes. These bytes are then decoded, and only then found not to be a valid in the specified encoding.
If you're not specifying an encoding, python guesses it from the operating system configuration:
encoding is the name of the encoding used to decode or encode the (...). The default encoding is platform dependent (whatever locale.getpreferredencoding() returns)
It seems that on your system locale.getpreferredencoding() returns ASCII, and the file is not encoded in ASCII.
Simply specify the correct encoding:
with open('filename.txt', 'r', encoding='utf-8') as file:
list_of_words = [line.strip('\n') for line in file]
Actually this is just poorly worded explanation in documentation. Open function does not read any content it only returns file handle with specified mode and encoding that was passed to it from OS.

Related Links

how to correctly check for scroll end?
Pygame keypress should call function just once
How to convert character offset to Text widget's column line number position?
Python: Iterating through tuples which are related as subclassed tuples
Set variables in python list comprehension
How to link a library .a with f2py?
Possible to use list comprehension over iterable with different lengths?
Matplotlib breaks my monitors. How do I start debugging this?
Always get HttpError 404 by change IDs of push notifications when uploading zip, rar, tar.gz to Google Drive
Flask-Login : Prevent cookie reuse flaw
Combined ast.literal_eval with sys.argv
Python command interpreter
Python installing package MAC OS X
Tweepy streaming module language filter not working
lxml, get xml between elements
Date filter in list-comphresion with class variable

Categories

HOME
laravel-5
cluster-computing
semantic-ui
google-cloud-bigtable
mql4
meshlab
jar
drupal-7
jasmine
otrs
angular2-directives
phpstorm
csvhelper
flyway
laravel-4
devstack
netflix-feign
keystore
rapidjson
primes
multiple-columns
jconsole
prediction
ghost-inspector
cocoa-touch
jquery-scrollify
pdfsharp
compare-and-swap
java-ee-7
owl-api
hana-studio
partial-application
google-api-dotnet-client
sharp
jboss5.x
xor
clockwork
polymorphism
modelandview
reactivemongo
superpowered
passenger
boilerplate
nlb
npm-publish
mongoexport
visualstudio.testtools
finite-element-analysis
3scale
email-parsing
goquery
rollback
crystal-reports-8.5
static-code-analysis
addin-express
festival
keycode
hspec
c64
manifest.mf
firepath
tuxedo
dts
nssplitview
mercurial-hook
project-online
group-concat
axes
portfolio
void
function-fitting
file-diffs
matcaffe
oai
fabric-twitter
android-snackbar
apigee-baas
webhdfs
ytplayerview
ubercart
uptodate
jcr-sql2
left-recursion
temp-tables
snmp4j
dache
wss
sgml
dbsetup
healthvault
arangodb-php
real-time-updates
windows-mobile-6
cdata
multiple-login
argb
awesomeprint
vertical-scrolling
trailing-slash
dynamic-data
returnurl
scrollto
mysql-error-1045
webkit-transform
associative
server-error
sendfile
zend-form-element
out-of-browser
loadui
j-interop
lazy-c++
internals
writing

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App