python


When does Python decode a byte string while reading a file?


I have a text file with keywords and I use
with open('filename.txt','r') as file:
list_of_words = [x.strip('\n') for x in file.readlines()
I get a:
UnicodeDecodeError : 'ascii' codec can't decode byte 0xc4 in position 5595: ordinal not in range(128)
on line 2
I understand the error. I don't understand why is it on line 2.
According to python docs: https://docs.python.org/3/library/functions.html#open
In text mode (the default, or when 't' is included in the mode
argument), the contents of the file are returned as str, the bytes
having been first decoded using a platform-dependent encoding or using
the specified encoding if given.
This means that while opening the file, the decoding process happens while opening the file and returned in 'file' variable.
Why do I get the error on line 2 then?
You seem to be confusing the file object returned by the open() call, for the actual process of reading from a file object.
Python decodes the contents of the file, as you read it. Opening a file object doesn't read any data from the file, it just creates a file object. No data is read from the file at that point, there are no bytes for Python to process yet.
In line 2 you actually read from the file, using the file.readlines() method. It is that method that tells the file object to fetch data from the filesystem (bytes) and decode those bytes. Only then can Python know that the data cannot actually be decoded as ASCII.
Opening the file doesn't actually examine the contents. Only when some of the file data is returned via one of its several methods that perform a read can the contents be decoded.
You don't get an error immediately because the file is just opened, not read from in line 1. Opening a file just involves acquiring a handle from the operating system to the file - no contents are being read.
Only when you call readlines, read, iterate over the file, or otherwise read from the file you actually get contents, initially as bytes. These bytes are then decoded, and only then found not to be a valid in the specified encoding.
If you're not specifying an encoding, python guesses it from the operating system configuration:
encoding is the name of the encoding used to decode or encode the (...). The default encoding is platform dependent (whatever locale.getpreferredencoding() returns)
It seems that on your system locale.getpreferredencoding() returns ASCII, and the file is not encoded in ASCII.
Simply specify the correct encoding:
with open('filename.txt', 'r', encoding='utf-8') as file:
list_of_words = [line.strip('\n') for line in file]
Actually this is just poorly worded explanation in documentation. Open function does not read any content it only returns file handle with specified mode and encoding that was passed to it from OS.

Related Links

Extracting table data from html with python and BeautifulSoup
Python Strings and Piglatin [closed]
Python: Compare elements in a list to each other
Dynamically create tabs QTabWidget and fill tables QTableWidget
Share global data in a Flask/mod_wsgi app between all mod_wsgi processes
Python function
matplotlib setting pdf raster resolution using pdfpages savefig dpi
In SQLAlchemy how do I preview SQL statements before committing for debugging purposes?
Problems with git diff working on command line but not in Python
Python, selenium webdriver - I need base class method to return type of its child class. How to achieve it?
dictionary in python 2.7
finding duplicates function
Django: copying file copies only a part
Parse a text file in an array python
SciPy - Constrained Minimization derived from a Directed Graph
Converting Pandas dates to Chaco compliant dates

Categories

HOME
swift
tensorflow
app-inventor
jenkins-plugins
payment-gateway
xamarin.android
wsdl
paramiko
ruby-on-rails-3
automated-tests
jscript
xtext
histogrammar
tizen-wearable-sdk
eval
contact-form-7
deeplearning4j
diagram
iso
multiple-monitors
bar-chart
session-timeout
device-detection
aws-cognito
deb
myob
npm-install
activepython
sumo
java-ee-7
unpack
pentaho-report-designer
system.data.sqlite
nesc
reactiveui
pitest
gpib
ipfw
openpgp
powershell-remoting
mmdrawercontroller
xen
zero
sql-like
knockout-3.0
jrules
pytest-django
facebook-chatbot
filepath
aqgridview
fqdn
double-buffering
defold
seamless-immutable
azureportal
ogre3d
greenhills
google-maps-ios
photography
launch
opencpu
gmt
hspec
children
django-filer
qregexp
ng2-material
asp.net-mvc-partialview
magento-1.4
altbeacon
web-essentials
galaxy
getrusage
file-diffs
intellitest
application-loader
sourcegear-vault
subresource-integrity
team-build
manjaro
ejabberd-saas
lexicographic
inputaccessoryview
moai
author
svcutil.exe
openkinect
mimosa
bfd
dynamic-binding
ienumerator
backbone-relational
argb
manage.py
symphony-cms
pascals-triangle
fluidsynth
snapjs
ubuntu-11.10
osi
linkedhashset
xgettext
cosm
text-services-framework
jquery-selectbox
flexicious
msgbox
zookeeper
gtk2hs
dcpu-16
smooth
pivotal-crm
quick-search
unreachable-code
camtasia
google-instant
revision
evb
handheld
data-entry
visual-studio-dbpro

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App