File Handling ============= *(C) Copyright Notice: This chapter is part of the book available at*\ https://pp4e-book.github.io/\ *and copying, distributing, modifying it requires explicit permission from the authors. See the book page for details:*\ https://pp4e-book.github.io/ All variables used in a program are kept in the main memory and they are **volatile**, i.e., their values are lost when the program ends. Even if you write a program running forever, your data will be lost in case of a shutdown or a power failure. Another drawback of the main memory is the capacity limitation. In the extreme case, when you need more than a couple of gigabytes for your variables, it will be difficult to keep all of them in the main memory. Especially, infrequently required variables are better kept on an external storage device instead of the main memory. **Files** provide a mechanism for storing data **persistently** in hard drives which provide significantly larger storage than the main memory. These devices are also called *secondary storage* devices. The data you put in a file will stay on the hard drive until someone overwrites or deletes the file (or when the hard drive fails, which is a sad but rare case). A **File** is a sequence of bytes stored on the secondary storage, typically hard drive (alternative secondary storage devices include CD, DVD, USB disk, tape drive). Data on a file has the following differences from data in memory (variables): 1. A file is just a sequence of bytes. Therefore, data in a file is unorganized, there is no data type, no variable boundaries. 2. Data needs to be be accessed indirectly, using I/O functions. E.g. updating a value in a file requires reading it in memory, updating in memory, then writing it into the file back. 3. Accessing and updating data is significantly slower since it is on an external device. Keeping data on a file instead of the main memory has the following use cases: 1. Data needs to be persistent. Data will be in the file when you restart your program, reboot your machine or when you find your ancient laptop in the basement 30 years later (Probably it will not be there when a 3000BC archeologist finds your laptop on an excavation site. Hard disks are not that durable. So, persistency is bounded). 2. You need to exchange data with another program. Examples: - You download data from web and your program gets it as input. - You like to generate data in your program and put it on a spreadsheet for further processing. 3. You have large amount of data which does not fit in the main memory. In this case, you will probably use a library or software like a database management system to access data in a faster and organized way. Files are the most primitive, basic way of achieving it. In this chapter, we will talk about simple file access so that you will learn about simple file operations like open, close, read, write. The examples of the chapter will create and modify files when run – we strongly encourage you to check the contents of the created files using the file access mechanism at the left-hand side. First Example ------------- Let us quickly look at a simple example to get a feeling for the different steps involved in working with files. .. code:: python fpointer = open('firstexample.txt',"w") fpointer.write("hello\n") fpointer.write("how are\n") fpointer.write("you?\n") fpointer.close() The program above will create a file in the current directory with filename ``firstexample.txt``. You can open it with your favourite text editor (there are plenty of text editors for operating systems: notepad, wordpad, textedit, nano, vim) to see and edit it. The content will look like this: :: hello how are you? First line of the program is ``fpointer = open('firstexample.txt',"w")``. This line opens the file named ``firstexample.txt`` for writing to it. If the file exists, its content will be erased (it will be an empty file afterwards). The result of ``open()`` is a file object that we will use in the following lines. This object is assigned to variable ``fpointer``. In the following lines, all functions we call with this file object ``fpointer`` will work on this file (i.e. ``firstexample.txt``). This special *dot* notation helps us with calling functions in scope of the file. ``fpointer.``\ *functionname*\ ``()`` will call the *functionname* function for this file. ``write(string)`` function will write the ``string`` content to the file. Each call to ``write(string)`` will append the ``string`` to the file and the file will grow. At the end, when we are done, we call ``close()`` to finish accessing the file so that your operating system will know and do necessary actions about it. All open files will be closed when your program terminates. However, calling ``close()`` after finishing writing is a good programming practice. Now, let us read this file: .. code:: python fp = open("firstexample.txt","r") content = fp.read() fp.close() print(content) .. parsed-literal:: :class: output hello how are you? In this case, we called ``open()`` with argument ``"r"`` which tells that we are going to read the file (or use it as an input source). If you skip the second argument in ``open()``, it is assumed to be ``"r"``, so ``open("firstexample.txt")`` will be equivalent. The ``read()`` call gets an optional argument, which is the number of bytes to read. If you skip it, it will read the whole file content and return it as a string. Therefore, after the call, the ``content`` variable will be a string with the file content. Files and Sequential Access --------------------------- A file consists of bytes and ``read/write`` operations access those bytes **sequentially**. In sequential access, the current I/O operation updates the file state so that next I/O operation will resume from the end of the current I/O operation. Assume you have an old MP3 player that supports only *play me next 10 seconds* operation on a button. Pressing it will play the next 10 seconds of the song. When you press again, it will resume from where it is left and play another 10 seconds. This follows until the song is over. The sequential access is like this. A *file pointer* keeps the current offset of the file and each I/O operation advances it so that next call will read or write from this new offset – see :numref:`ch8_file_handling`. .. _ch8_file_handling: .. figure:: ../figures/ch8_sequential_read.png :width: 500px Sequential reading of a file. The following is a sample program illustrating sequential access: .. code:: python fp = open("firstexample.txt","r") # the example file we created above for i in range(3): # repeat 3 times content = fp.read(4) # read 4 bytes in each step print("> ", content) # output 4 bytes preceded by > fp.close() .. parsed-literal:: :class: output > hell > o ho > w ar The text in the file was: :: hello how are you? The first ``read()`` reads ``'hell'``, the second reads ``'o\nho'`` (note that ``\n`` stands for a new line so that ``ho`` is printed on a new line), and the third reads ``'w ar'``. After these operations, the file offset is left at a position so that the following reads will resume from content ``'e\nyou\n'``. We provided the example with 4-byte read operations. However, for text files, the typical scenario is reading characters line by line instead of fixed size strings. Data Conversion and Parsing --------------------------- A file, specifically a text file, consists of strings. However, especially in engineering and science, we work with numbers. A number is represented in a text file as a sequence of characters including digits, a sign prefix (``'-'`` and ``'+'``) and at most one occurrence of a dot (``'.'``). That means in your Python program, you may use :math:`\pi` as ``3.1416`` however, in the text file, you store ``'3.1416'``, which is a string consisting of chars ``'3','.','1','4','1','6'``. .. code:: python pi = 3.1416 pistr = '3.1416' print(pi+pi,':', pi * 3) print(pistr+pistr,':', pistr *3) .. parsed-literal:: :class: output 6.2832 : 9.4248 3.14163.1416 : 3.14163.14163.1416 Funny, the second line of output above is a result of Python interpreting ``+`` operator as string concatenation, and ``*`` as adjoining multiple copies of a string. If we need to treat numbers as numbers, we need to convert them from string. There are two handy functions for this: ``int()`` and ``float()`` convert a string into an integer and a floating point value, respectively. Here is an illustration: .. code:: python pistr = ' 0.31416E01 ' nstr = ' 47 ' # Convert numbers in the strings into numerical data types: piflt = float(pistr) nint = int(nstr) print(piflt*2, nint*2) .. parsed-literal:: :class: output 6.2832 94 Note that we cannot call ``int('3.1416')`` since the string is not a valid integer. That brings us another challenge of making sure that strings we need to convert are actually numbers. Obviously ``int('hello')`` and ``float('one point five')`` will not work. The ways of dealing with such errors are left for the next chapter. In this chapter, we assume that we have our data carefully created and all conversions work without any error. Our next challenge is having multiple numbers on a string separated by special characters or simply spaces as ``'10.0 5.0 5.0'``. In this case, we need to decompose a string into string pieces representing numbers, so that we will have ``'10.0','5.0','5.0'`` for the above string. The next step will be converting them into numbers: ``'10.0 5.0 5.0'`` :math:`\overset{Step\ 1}{\longrightarrow}` [``'10.0','5.0','5.0'``] :math:`\overset{Step\ 2}{\longrightarrow}` ``[10.0, 5.0, 5.0]`` For the first step, we will use the ``split()`` method of a string. String, or the variable containing the string, is followed by ``.split(delimiter)``, which returns a list of strings separated by the given delimiters. The delimiters are removed and all values in between are put in the list – for example: .. code:: python print('a:b:c'.split(':')) print('hello darkness, my old friend'.split(' ')) print('a <=> b <=> c'.split(' <=> ')) print('multiple spaces are tricky'.split(' ')) a = '10.0 5.0 5.0' print(a.split(' ')) .. parsed-literal:: :class: output ['a', 'b', 'c'] ['hello', 'darkness,', 'my', 'old', 'friend'] ['a', 'b', 'c'] ['multiple', '', '', '', '', '', '', 'spaces', '', '', '', '', '', '', '', '', '', 'are', '', '', '', '', '', '', '', '', 'tricky'] ['10.0', '5.0', '5.0'] For the second step, we will use the ``float()`` function on a list (or the ``int()`` function if you have a list of integers). We have couple of options for this. One is to start from an empty list and append the converted value at each step: .. code:: python instr = '10.0 5.0 5.0' outlst = [] # Go over each substring for substr in instr.split(' '): outlst += [float(substr)] # Convert each element to float and append it to the list print(outlst) .. parsed-literal:: :class: output [10.0, 5.0, 5.0] A more practical and faster version will be list comprehension, which is the compact version of mapping a value into another as: .. code:: python instr = '10.0 5.0 5.0' outlst = [ float(substr) for substr in instr.split(' ')] print(outlst) .. parsed-literal:: :class: output [10.0, 5.0, 5.0] As we have explained in Chapter 3, the syntax is similar to set/list notation in Math: :math:`\left\{ float(s)\ \vert\ s \in S\right\}` vs. ``[float(s) for s in S]`` If you need to have multiple spaces within the values, you can use “``import re``” and call “``re.split(' +', inputstr)``” instead of “``inputstr.split(' ')``”. This will split the ``'multiple spaces are tricky'`` example above into 4 words without spaces. How it works is beyond the scope of the book. Curious readers can refer to “``re``” and “``parse``” modules for more advanced forms of input parsing. These are not trivial modules for beginners. Now, let us consider the reverse of the operation: Assume we have a list of integers and we like to convert that into string that can be written in a file: ``[10.0, 5.0, 5.0]`` :math:`\overset{Step\ 1}{\longrightarrow}` ``["10.0","5.0","5.0"]`` :math:`\overset{Step\ 2}{\longrightarrow}` ``"10.0 5.0 5.0"`` The first step will be handled with the ``str()`` function which converts any Python value into a human readable string: .. code:: python inlst = [10.0, 5.0, 5.0] outlst = [str(num) for num in inlst] print(outlst) .. parsed-literal:: :class: output ['10.0', '5.0', '5.0'] The next step is to join those elements with a delimiter, which is reverse of the ``split()`` operation. Not by accident, name of this operation is ``join()``. ``join()`` is a method of the delimiter string and list is the argument of it. ``':'.join(['hello','how','are','you?'])`` returns ``'hello:how:are:you?'``. .. code:: python inlst = [10.0, 5.0, 5.0] outlst = [str(num) for num in inlst] print(' '.join(outlst)) .. parsed-literal:: :class: output 10.0 5.0 5.0 A more advanced way of converting values into strings is called *formatted output* and briefly introduced in a section below. Accessing Text Files Line by Line --------------------------------- Files consisting of human readable strings are called **text files**. Text files consist of strings separated by the *end of line* character ``'\n'``, also known as *new line*. The sequence of characters in a file contains the end-of-line characters so that a text editor will end the current line and show following characters on a new line. We use end-of-line characters so that logically relevant data is on the same line. For example: :: 4 10.0 20.0 15.5 22.2 3 44 10 10.5 Let us assume the integer value ``4`` on the first line denotes how many lines will follow. Assume also that each of the following 4 lines have two real values denoting :math:`x` and :math:`y` values of a point. In this way, we can represent our input separated by end-of-line characters for each point and by space character for each value in a line. Let us create such a text file from a Python list. Please note that the file ``read`` function returns a string, the ``write`` function expects a string argument. I.e., calling ``write(3.14)`` will fail. In order to make the conversion, we use the ``str()`` function for numeric values and call ``write(str(3.14))`` instead. Another tricky point is that ``write()`` does not put end of line character automatically. You need to put it in the output string or call an extra ``write("\n")``. .. code:: python pointlist = [(0,0), (10,0), (10,10), (0,10)] fp = open("pointlist.txt", "w") # open file for writing fp.write(str(len(pointlist))) # write list length fp.write('\n') # Go over each point in the list for (x,y) in pointlist: # for each x,y value in the list fp.write(str(x)) # write x fp.write(' ') # space as number separator fp.write(str(y)) # write y fp.write('\n') # \n as line separator fp.close() # let us read the content to verify what we wrote fp = open("pointlist.txt") # open for reading content = fp.read() print(content) fp.close() .. parsed-literal:: :class: output 4 0 0 10 0 10 10 0 10 Using ``read()`` will get the whole content of the file; if the file is large, your program would use too much memory and processing the data will be difficult. Instead of that, we can access a text file line by line using the ``readline()`` function. Let us write a program to read and output the content of a text file. We need a loop to read the file line by line and output. But, when we are going to stop is another problem. Python’s ``read()`` and ``readline()`` functions return an empty string ``''`` when there is nothing left to read. We can use this to stop reading: .. code:: python fp = open("pointlist.txt") # open file for reading nextline = fp.readline() # read the first line while nextline != '': # while read is successful print(nextline) # output the line nextline = fp.readline() # read the nextline fp.close() # when nextline == '' loop terminates .. parsed-literal:: :class: output 4 0 0 10 0 10 10 0 10 Please note the empty lines between the each output line. This is due to ``'\n'`` character at the end of the string that ``readline()`` returns. In other words, ``readline()`` keeps the new line character it reads. ``print()`` puts an end of line after the output (this can bu suppressed by adding an ``end=''`` argument). As a result, we have extra end-of-line at the end of each line. In order to avoid it, you can call ``rstrip('\n')`` on the returned string to remove end of line. The new code will be: .. code:: python fp = open("pointlist.txt") # open file for reading nextline = fp.readline() # read the first line while nextline != '': # while read is successful nextline = nextline.rstrip('\n') # remove occurrences of '\n' at the end print(nextline) # output the line nextline = fp.readline() # read the nextline fp.close() .. parsed-literal:: :class: output 4 0 0 10 0 10 10 0 10 Converting this file into the initial Python list ``[(0,0), (10,0), (10,10), (0,10)]`` is our next challenge. This requires conversion of a string as ``"0 0\n"`` into ``(0,0)``. The first one is of type ``str`` whereas the second is a tuple of numeric values. We can use ``int()`` or ``float()`` functions to convert strings into numbers. Note that the string should contain a valid representation of a Python numeric value: ``int("hello")`` will raise an error. The second issue is separating two numbers in the same string. We can use ``split()`` function followed by the separator string as in ``nextline.split(' ')``. This call will return a sequence of strings from a string. If the separator does not occur in the string, it will return a list with one element, if there is one separator, it will return two elements. For :math:`n` occurrences of the separator, it will return a list with :math:`n-1` elements. Here is the solution in Python: .. code:: python fp = open("pointlist.txt") # open file for reading pointlist = [] # start with empty list nextline = fp.readline() # read the first line n = int(nextline) # find number of lines to read for i in range(n): # repeat n times nextline = fp.readline() # read the nextline nextline = nextline.rstrip('\n') # remove occurrences of '\n' at the end (x, y) = nextline.split(' ') # get x and y (note that they are still strings) x = float(x) # convert them into real values y = float(y) pointlist.append( (x,y) ) # add tuple at the end fp.close() print(pointlist) # output the resulting list .. parsed-literal:: :class: output [(0.0, 0.0), (10.0, 0.0), (10.0, 10.0), (0.0, 10.0)] Termination of Input -------------------- There are two ways to stop reading input: 1. By reading a definite number of items. 2. By the end of the file. In our previous examples, we read an integer that told us how many lines followed in the file. Then, we called ``readline()`` in a ``for`` loop with the given number of lines. This is an example of the first case which provides a definite number of items. The alternative is to read lines in a ``while`` loop until a termination condition arises. The termination condition is usually the **end of file**, the case where functions like ``read()`` and ``readline()`` return an empty string ``''``. .. code:: python fp = open("pointlist.txt") # open file for reading pointlist = [] # start with empty list nextline = fp.readline() # skip the first line (4) since we don't need it nextline = fp.readline() # read the first line while nextline != '': # until end of file nextline = nextline.rstrip('\n') # remove occurrences of '\n' at the end (x, y) = nextline.split(' ') # get x and y (note that they are still strings) x = float(x) # convert them into real values y = float(y) pointlist.append( (x,y) ) # add tuple at the end nextline = fp.readline() # read the nextline fp.close() print(pointlist) .. parsed-literal:: :class: output [(0.0, 0.0), (10.0, 0.0), (10.0, 10.0), (0.0, 10.0)] Note that the example above skips (reads and throws away) the first line so that the integer on the first line is ignored. When your input does not contain such an unnecessary value, you can delete this line. Sometimes termination can be marked explicitly by a *sentinel value* which is a value marking the end of values. This is especially useful when you have multiple objects to read: .. code:: python # First, create a file named `twopointlists.txt` fp = open("twopointlists.txt", "w") fp.write("""3 0 3.4 2.1 5.1 3.2 EOLIST 1 1.5 2.0 2.5""") fp.close() This will create a sample file content as: :: 3 0 3.4 2.1 5.1 3.2 EOLIST 1 1.5 2.0 2.5 .. code:: python fp = open("twopointlists.txt") pntlst1 = [] # start with empty list pntlst2 = [] # start with empty list nextline = fp.readline() # read the first line while nextline != 'EOLIST\n': # sentinel value nextline = nextline.rstrip('\n') # remove occurrences of '\n' at the end (x, y) = nextline.split(' ') # get x and y (note that they are still strings) x = float(x) # convert them into real values y = float(y) pntlst1.append( (x,y) ) # add tuple at the end nextline = fp.readline() # read the nextline # first list has been read, now continue with the second list from the same file nextline = fp.readline() while nextline != '': # until end of file nextline = nextline.rstrip('\n') # remove occurrences of '\n' at the end (x, y) = nextline.split(' ') # get x and y (note that they are still strings) x = float(x) # convert them into real values y = float(y) pntlst2.append( (x,y) ) # add tuple at the end nextline = fp.readline() # read the nextline fp.close() print('List 1:', pntlst1) print('List 2:', pntlst2) .. parsed-literal:: :class: output List 1: [(3.0, 0.0), (3.4, 2.1), (5.1, 3.2)] List 2: [(1.0, 1.5), (2.0, 2.5)] The output will be: :: List 1: [(3.0, 0.0), (3.4, 2.1), (5.1, 3.2)] List 2: [(1.0, 1.5), (2.0, 2.5)] Example: Processing CSV Files ----------------------------- **CSV** stands for *Comma Separated Value*; it is a text-based format for exporting/importing spreadsheet (i.e. Excel) data. Each row in a CSV file is separated by newlines and each column is separated by a comma ``,``. Actually, the format is more complex but for the time being, let us ignore comma that might be appearing in strings and focus on a simple form as follows: :: Name,Surname,Grade Han,Solo,80 Luke,Skywalker,90 Obi,Van Kenobi,88 Leya,Skywalker,91 Anakin,Skywalker,55 Usually first line is the names of the columns in a spreadsheet. Now, let us create this file: .. code:: python content = '''Name,Surname,Grade Han,Solo,80 Luke,Skywalker,90 Obi,Van Kenobi,88 Leya,Skywalker,91 Anakin,Skywalker,55 ''' fp = open("first.csv", "w") # open for writing fp.write(content) # write in a single operation, practical for small files fp.close() Our next task is to read this file in memory as a list of dictionary form, as: ``[{"Name":"Han", "Surname":"Solo","Grade":"80"},...]`` We need to read the file line by line, extract the components using the ``split()`` function, then create the dictionary. Then, we can append it to resulting list. For example: .. code:: python fp = open("first.csv","r") # open for reading line = fp.readline() # read column names line = line.rstrip('\n') # get rid of new line colnames = line.split(',') # list of column names result = [] # resulting list of dictionaries line = fp.readline() while line != '': # end-of-file check line = line.rstrip('\n') entry = {} # start with empty dictionary c = 0 # a counter to address column number for v in line.split(','): # in a loop process each column of the row entry[colnames[c]] = v # column name is index, value is from current row c += 1 result.append(entry) # add dictionary to result line = fp.readline() # read next line fp.close() print(type(result)) print(result) .. parsed-literal:: :class: output [{'Name': 'Han', 'Surname': 'Solo', 'Grade': '80'}, {'Name': 'Luke', 'Surname': 'Skywalker', 'Grade': '90'}, {'Name': 'Obi', 'Surname': 'Van Kenobi', 'Grade': '88'}, {'Name': 'Leya', 'Surname': 'Skywalker', 'Grade': '91'}, {'Name': 'Anakin', 'Surname': 'Skywalker', 'Grade': '55'}] Let us improve this example by adding a column as a result of a computation. Let us calculate the grade average and show the difference from the average as a new column. We need to go over all grade values in the list, convert to them real values (so that we can do arithmetic on them), calculate the average, then go over all rows to add a new column. Then, go over the list again to export/write it into a new CSV file. .. code:: python n = 0 # Calculate the average sum = 0 for entry in result: sum += float(entry['Grade']) n += 1 average = sum / n # Calculate the difference of each grade from the average for entry in result: entry['Avgdiff'] = str(float(entry['Grade']) - average) # Write the updated content into another CSV file fp = open('second.csv', 'w') colnames = entry.keys() # this returns the keys (column names) of the CSV file fp.write(','.join(colnames) + '\n') # write this as the first line with comma separated values for entry in result: # Go over each row vals = [] for key in colnames: # Write each column on this row vals.append(entry[key]) # extract values of entry, entry.values() is a short version of this fp.write(','.join(vals) + '\n') # Finished, close the file fp.close() .. code:: python %cat second.csv .. parsed-literal:: :class: output Name,Surname,Grade,Avgdiff Han,Solo,80,-0.7999999999999972 Luke,Skywalker,90,9.200000000000003 Obi,Van Kenobi,88,7.200000000000003 Leya,Skywalker,91,10.200000000000003 Anakin,Skywalker,55,-25.799999999999997 Formatting Files ---------------- Sometimes readability is important for text files, especially if data is in a tabular form. For example, seeing all related data in a column start at the same position can improve readability significantly. The following shows the unformatted and formatted versions of the same data side by side: :: Name,Surname,Grade,Avgdiff Name , Surname , Grade, Avgdiff Han,Solo,80,-0.7999999999999972 Han , Solo , 80, -0.800 Luke,Skywalker,90,9.200000000000003 Luke , Skywalker, , 90, 9.200 Obi,Van Kenobi,88,7.200000000000003 Obi , Van Kenobi , 88, 7.200 Leya,Skywalker,91,10.200000000000003 Leya , Skywalker , 91, 10.200 Anakin,Skywalker,55,-25.799999999999997 Anakin , Skywalker , 55,-25.800 In order to achieve this, you can use the ``format()`` method of a template string as in ``'{:10}, {:20}, {:3d}, {:7.3f}'.format('Han', 'Solo', 80, -0.2)'``. Each ``{}`` in the template matches a data value in the arguments. The value after ``:`` denotes the (minimum) width of the data. If data fits in a smaller number of characters, spaces are inserted on the right to make it have exactly given size (left-aligned). For integers, the number is followed by a ``d`` to format it as a decimal value spaced padded on the left (right aligned). For floating point values, this value can be followed by a ‘.’ and another number and an ``f``. The second number denotes the size of the fraction, ``f`` marks this value as a float, and the fraction part is rounded to given number of digits. The detailed description of ``format()`` is out of the scope of this course and the document. For detailed description, please refer to Python reference manuals. Let us rewrite the output part of the code using formatted output: .. code:: python template = '{:10}, {:20}, {:5d}, {:7.3f}\n' fp = open('third.csv', 'w') colnames = entry.keys() # this returns the keys of the CSV file fp.write('{:10}, {:20}, {:5}, {:7}\n'.format(*colnames) ) # write this as the first line with comma separated values for entry in result: fp.write(template.format(entry['Name'],entry['Surname'], # convert strings to numbers to respect number formatting int(entry['Grade']), float(entry['Avgdiff']))) fp.close() .. code:: python # Let us display the content of the file after formatting: %cat third.csv .. parsed-literal:: :class: output Name , Surname , Grade, Avgdiff Han , Solo , 80, -0.800 Luke , Skywalker , 90, 9.200 Obi , Van Kenobi , 88, 7.200 Leya , Skywalker , 91, 10.200 Anakin , Skywalker , 55, -25.800 Binary Files ------------ So far, we have only looked at text files where all values are represented as human readable text where all numerical values are represented as decimal strings. However, if you remember our early chapters, computers do not store and process numbers as decimal digit sequences. They store variables in binary format like Two’s Complement and the IEEE754 floating point standard. In order to process, read and write decimal data in text, programming language and libraries convert data. Even though you won’t notice the time spent in conversion, if you read 10 millions of numbers, you start spending significant amount of CPU time for converting data. Binary files, on the other hand, store numbers as they are stored in computer’s memory. They are still sequences of bytes, but in a more structured way. For example, a 4-byte integer is kept as a sequence of 4 bytes, each byte is a part of the number in Two’s Complement form. Reading a binary file is simply copying data to memory, either no conversion is performed or only the order of bytes is changed. A floating point number ``0`` takes 1 byte in a text file, but the number ``3.1415926535897932384626433832795028`` takes 34 bytes. In a binary file, the total size of a number is fixed as the size of IEEE 754 format, i.e. 4 bytes on a 32-bit computer. Both ``0`` and the :math:`\pi` are stored in 4 bytes for single precision, 8 bytes for double precision, in a binary file. Keeping values in binary files has the following advantages: 1. It is more compact: Data occupies less space in the file. 2. No decimal to binary conversion is required. More efficient in terms of CPU usage. 3. Since sizes are fixed, randomly jumping to a location and reading relevant data is possible. In a text file, you have to start from the beginning and read all lines up to the relevant data. This kind of usage is a more advanced case and harder to understand for beginners. On the other hand, using text files has the following advantages: 1. Files are human readable and editable. User can change data using a standard editor. In binary files, special software should be used. 2. File format is more flexible, using ``variablename: value`` patterns in the file, data can be stored in any order in a flexible way. This is why text files are often used as configuration files. Most of the special formats with ``.exe``, ``.xls``, ``.zip``, ``.pdf`` extensions are binary file formats. **Note:** Binary files are kept out of scope of this book. The following paragraphs give couple of pointers for curious readers. In order to use binary files: 1. You need to add ``'b'`` character in the second argument of ``open()`` method as: ``open('test.bin','rb')`` or ``open('test.bin','wb')``. 2. Binary I/O requires ``bytes`` typed values instead of ``str`` typed values. ``bytes`` is a sequence of bytes. Elements of a byte sequence are not printable in contrast to ``str``. 3. Python has ``struct`` module for converting any value into ``bytes`` value. ``struct.pack(format, values)`` converts values into ``bytes``. However, this conversion is much more cheaper computationally than decimal to binary conversion. 4. ``struct.unpack(format, bytesval)`` can be used to convert ``bytes`` value into Python values. It is much cheaper than binary to decimal conversion. 5. ``read()``, ``write()`` can be used as usual. In ``read(nbytes)``, data size should be given. ``struct.calcsize(format)`` can be used to calculate data size from format. The following is an example of binary I/O. Assume the binary file contains an integer :math:`N`, for the number of points, and followed by :math:`2 \times N` floating point values. Let us write and then read this data: .. code:: python import struct points = [(1,1), (2.5, 3.4), (5.4,3.3), (2.2, 1.121)] # 1- Open and write the binary file fp = open("points.bin", "wb") fp.write(struct.pack('i', len(points))) # 'i' denotes a single integer value is converted into bytes for (x,y) in points: fp.write(struct.pack('dd', x, y)) # 'dd' denotes two double precision floating point values are converted into bytes fp.close() # 2- Open and read the binary file fp = open("points.bin", "rb") # open same file for reading content = fp.read(struct.calcsize('i')) # read binary data with length sizeof integer bytes (n,) = struct.unpack('i', content) # unpack returns a tuple, 1tuple in this case newpoints = [] for i in range(n): # n times content = fp.read(struct.calcsize('dd')) (x,y) = struct.unpack('dd', content) # read two double precision floats newpoints += [(x,y)] # append value at the end fp.close() # 3- Print the read and converted values print("The read & converted points are:", newpoints) print("This is what binary data looks like:") fp = open("points.bin", "rb") print(fp.read()) fp.close() .. parsed-literal:: :class: output The read & converted points are: [(1.0, 1.0), (2.5, 3.4), (5.4, 3.3), (2.2, 1.121)] This is what binary data looks like: b'\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x04@333333\x0b@\x9a\x99\x99\x99\x99\x99\x15@ffffff\n@\x9a\x99\x99\x99\x99\x99\x01@V\x0e-\xb2\x9d\xef\xf1?' Note on Files, Directory Organization and Paths ----------------------------------------------- Files are organized under directories so that you can put relevant files under same category together. Operating systems provide a **filesystem hierarchy** consisting of directories (some prefer word **folder** instead) and regular files. Directories can be arbitrarily nested. You may need to traverse *N* levels of directories to find your file. For example, your program can be in the ``Homeworks`` directory under the ``Desktop`` directory under the ``user`` directory under the ``Desktop Users`` directory under the root directory (``/``), which is the topmost level on your filesystem. The top-level directory and the seperator is backslash,\ ``\`` in MS Windows operating systems. However, ‘/’ works in Python for Windows too. In order to address a file, we use a **path** which is a sequence of directory names separated by ``/``, ended by the name of the file. For example, ``"/Desktop Users/user/Desktop/Homeworks/homework1.py"`` is a path for the file named ``"homework1.py"``. A path can be either **full** (absolute) or **relative**. In the former case, it starts with a slash (``/``). In the relative case, it is considered relative to the current working directory, i.e. the directory where you started your program. Full paths ignore your current directory whereas relative paths depend on it. For example, if you are currently in the ``Desktop`` directory, the path ``"Homeworks/homework1.py"`` will address the same full path above. List of File Class Member Functions ----------------------------------- For completeness, below you can find commonly used member functions of the ``file`` class. Assume **F** is a file. In the table below you will find some of the very frequently used member functions of files (in the **Operation** column anything in square brackets denotes that the content is optional – if you enter the optional content, do no type in the square brackets): +--------------------------+-------------------------------------------+ | Operation | Result | +==========================+===========================================+ | ``F.se | Set file *F*\ ’s position, like stdio’s | | ek(offset[, whence=0])`` | ``fseek()``. *whence ≟* ``0`` then use | | | absolute indexing (using *offset*). | | | *whence ≟* ``1`` then *offset* relative | | | to current pos. *whence ≟* ``2`` then | | | *offset* relative to file end. | +--------------------------+-------------------------------------------+ | ``F.tell()`` | Return file *F*\ ’s current position | | | (byte offset). | +--------------------------+-------------------------------------------+ | ``F.truncate([size])`` | Truncate *F*\ ’s size. If *size* is | | | present, *F* is truncated to (at most) | | | that size, otherwise *F* is truncated at | | | current position (which remains | | | unchanged). | +--------------------------+-------------------------------------------+ | ``F.write(str)`` | Write string *str* to file *F*. | +--------------------------+-------------------------------------------+ | ``F.writelines(list)`` | Write *list* of strings to file *F*. No | | | ``EOL`` are added. | +--------------------------+-------------------------------------------+ | ``F.close()`` | Close file *F*. | +--------------------------+-------------------------------------------+ | ``F.fileno()`` | Get fileno (fd) for file *F*. | +--------------------------+-------------------------------------------+ | ``F.flush()`` | Flush file *F*\ ’s internal buffer. | +--------------------------+-------------------------------------------+ | ``F.isatty()`` | 1 if file *F* is connected to a tty-like | | | dev, else 0. | +--------------------------+-------------------------------------------+ | ``F.next()`` | Returns the next input line of file *F*, | | | or raises ``StopIteration`` when ``EOF`` | | | is hit. | +--------------------------+-------------------------------------------+ | ``F.read([size])`` | Read at most *size* bytes from file *F* | | | and return as a string object. If *size* | | | omitted, read to ``EOF``. | +--------------------------+-------------------------------------------+ | ``F.readline()`` | Read one entire line from file *F*. The | | | returned line has a trailing ``\n``, | | | except possibly at ``EOF``. Return ``""`` | | | on ``EOF``. | +--------------------------+-------------------------------------------+ | ``F.readlines()`` | Read until ``EOF`` with ``readline()`` | | | and return a list of lines read. | +--------------------------+-------------------------------------------+ Important Concepts ------------------ We would like our readers to have grasped the following crucial concepts and keywords from this chapter: - Sequential access. File access. - Text files. Reading and writing text files. Parsing a text file. - End of file, new line. - Formatting files. - Binary files and binary file access. Further Reading --------------- - String formatting in Python: https://docs.python.org/3.4/library/string.html#formatspec - Working with binary data and files in Python: https://docs.python.org/3/library/binary.html - Comma-Separated Values (CSV) file format: https://en.wikipedia.org/wiki/Comma-separated_values Exercises --------- - Write a function that reads a text file with the following format (ignore characters following ``#``): :: N # Number of students # Empty line Name Surname # Fist student M # Number of courses that the student has taken Coursename1: Grade # Grade is a real number Coursename2: Grade Coursename3: Grade ... CoursenameM: Grade # Empty line Name Surname # Second student P # Number of courses that the student has taken Coursename1: Grade # Grade is a real number Coursename2: Grade Coursename3: Grade ... CoursenameP: Grade ... ... ... # Empty line Name Surname # Last student Z # Number of courses that the student has taken Coursename1: Grade # Grade is a real number Coursename2: Grade Coursename3: Grade ... CoursenameZ: Grade - Write a function that writes a list of dictionaries that have the following format into a text file. You may choose to write the number of elements at the top of the file. .. code:: python { "city": "Ankara", "plate code": "06", "max temperature (C)": 40, "min temperature (C)": -20, "population": 5700000 } - Write a function that reads a list of dictionaries from a file that you have written in the previous question. - Write a function that read the text file given below, represent the same content in binary, save it in a binary file and read it back. ``3 0 3.4 2.1 5.1 3.2 EOLIST 1 1.5 2.0 2.5``