When does Python decode a byte string while reading a file?
I have a text file with keywords and I use with open('filename.txt','r') as file: list_of_words = [x.strip('\n') for x in file.readlines() I get a: UnicodeDecodeError : 'ascii' codec can't decode byte 0xc4 in position 5595: ordinal not in range(128) on line 2 I understand the error. I don't understand why is it on line 2. According to python docs: https://docs.python.org/3/library/functions.html#open In text mode (the default, or when 't' is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given. This means that while opening the file, the decoding process happens while opening the file and returned in 'file' variable. Why do I get the error on line 2 then?
You seem to be confusing the file object returned by the open() call, for the actual process of reading from a file object. Python decodes the contents of the file, as you read it. Opening a file object doesn't read any data from the file, it just creates a file object. No data is read from the file at that point, there are no bytes for Python to process yet. In line 2 you actually read from the file, using the file.readlines() method. It is that method that tells the file object to fetch data from the filesystem (bytes) and decode those bytes. Only then can Python know that the data cannot actually be decoded as ASCII.
Opening the file doesn't actually examine the contents. Only when some of the file data is returned via one of its several methods that perform a read can the contents be decoded.
You don't get an error immediately because the file is just opened, not read from in line 1. Opening a file just involves acquiring a handle from the operating system to the file - no contents are being read. Only when you call readlines, read, iterate over the file, or otherwise read from the file you actually get contents, initially as bytes. These bytes are then decoded, and only then found not to be a valid in the specified encoding. If you're not specifying an encoding, python guesses it from the operating system configuration: encoding is the name of the encoding used to decode or encode the (...). The default encoding is platform dependent (whatever locale.getpreferredencoding() returns) It seems that on your system locale.getpreferredencoding() returns ASCII, and the file is not encoded in ASCII. Simply specify the correct encoding: with open('filename.txt', 'r', encoding='utf-8') as file: list_of_words = [line.strip('\n') for line in file]
Actually this is just poorly worded explanation in documentation. Open function does not read any content it only returns file handle with specified mode and encoding that was passed to it from OS.
Error in Spark while declaring a UDF
uwsgi.is_connected() delay with nginx
Positioning of multiple stacked bar plot with pandas
xlwings UDFS: how to set PythonPath/ UDF_Modules correctly?
Python: Numpy Array : cant access/reference to a numpy array from another class
how to delete entire row in csv file and save changes on same file?
Load multiple Django environments in sequence
Chromedriver: How to disable Google Chrome Helper
Read an Image with the Headerpart
How to store the output of type function in python and use it in 'if' condition? [closed]
OpenCV-Python VideoCapture only loads part of video
make bouncing turtle with python
How to count rows that share a unique field in pandas
Why I take slises wrong? [closed]
could not find tesseract for python 3.4
Grouping similar files and transferring into a list but in a dynamic way without user input common_text