Lab 10 Pre Lab



Reading files

Using files as input is a much quicker and easier way to get information from the user, especially for large amounts of data. Rather than having the user enter everything by hand, we can read in the data from a file.

To open a file for reading, we use the following command:
        myInputFile = open("theFile.txt", "r")

This line of code does three things:

  1. It opens the file theFile.txt
  2. The file is opened for reading ("r") — as opposed to writing
  3. The opened file is assigned to the variable myInputFile




Methods of reading

Once we have opened a file and assigned it to a variable, we can use that variable to access the file. There are four different ways to read in a file.

  1. Read the entire file in as one enormous string (including newlines)
            myInputFile.read()
  2. Read the file in as a list of strings (each line being a single string)
            myInputFile.readlines()
  3. Read in a single line of the file
            myInputFile.readline()
  4. Iterate over the file using a for loop, reading in a line each loop
            for singleLine in myInputFile:
           # singleLine contains a line from the file

Often, if we want to extract or examine data from a file, the last option (using a for loop to iterate over the lines of the file) is the most obvious choice.




Poem example

Below, you can see an example where we read in from a file, printing only those lines that are exactly 36 characters long.

    inputFile = open("road.txt") # Robert Frost's poem

    for line in inputFile:
        line = line.strip()      # remove the newline (and any other whitespace)
        if len(line) == 36:      # choose the lines to print
            print(line)

    inputFile.close()

The file "road.txt" contains the poem "The Road not Taken" by Robert Frost, and the code above prints out only those lines that contain exactly 36 characters. The output looks like this:

    Two roads diverged in a yellow wood,
    To where it bent in the undergrowth;
    And having perhaps the better claim,
    Though as for that the passing there
    Had worn them really about the same,
    In leaves no step had trodden black.
    Yet knowing how way leads on to way,
    Two roads diverged in a wood, and I—




String manipulation

This is fine, but often we want to look at the contents of a line, and make a decision based on that, rather than on something trivial like the line length.

For example, we may have a file that contains information about our employees and how many hours they worked this week. Using this information, we want to be able to determine which employees are full-time (work 30 hours or more) and which are part-time.

If we know the format of the file we are reading in, we can take advantage of the split() function to assign each token in a line to individual variables. (A token is a set of characters — we don't call it a "word" because it may be numbers, letters, whitespace, or a combination of any of the three.)




Worker hours example

If we take a look at the totalHours.txt file shown below, we can see that each line is formatted the same: employee id, employee name, and the total hours worked that week. Since we know the format, we can directly assign each piece to a separate variable, and use those variables to help decide which employees are full-time.

123 Suzy 18.5
456 Brad 35.0
789 Jenn 39.5
101 Thom 28.6

Remember that all of these variables will be strings to start off — so if we want to use them as integers or floats, we will need to first cast them to be that type.

    workerHours = open("hours.txt")
    for line in workerHours:
        # directly assign each token to a variable
        id, name, hours = line.split()
        # remember to cast to another type if needed
        if ( float(hours) >= 30):
            print(name, "is a full-time employee")
        else:
            print(name, "is a part-time worker")
    
    # don't forget to close the file!
    workerHours.close()

That code and the totalHours.txt file will give us the following output:

    Suzy is a part-time worker
    Brad is a full-time employee
    Jenn is a full-time employee
    Thom is a part-time worker




split() example

By default, the split() function uses allwhitespace (spaces, newlines, tabs, etc.) as the delimiter. The delimiter is the boundary between each token when the string is being split up. However, we can give it a specific character (or characters) to split on. Here's an example from class:

    nonsense = "nutty otters making lattes"
    nonsense = nonsense.split("tt")
    print(nonsense)
    # which will output this list of strings:
    # ['nu', 'y o', 'ers making la', 'es']

This is a bit of a silly example — normally when we choose to split on something that isn't whitespace, we are instead using some other sort of separator character. Using commas, semicolons, and underscores are all common choices, as can be seen in the example code below:

    courseInfo = "CMSC_201_Fall_2016_Sec_01"
    infoList = courseInfo.split("_")
    print(infoList)
    # which will output this list of strings:
    # ['CMSC', '201', 'Fall', '2016', 'Sec', '01']




String clean up

When we use the split() function with no parameters, it splits on whitespace. This means that it automatically removes any trailing whitespace (like a newline character) from the end of the string; any leading whitespace is also removed from the start of the string.

If we simply want to remove trailing and leading whitespace, and don't need to use the split() function, we can use the strip() function instead. It removes all of the whitespace from the start and end of a single string, but leaves all of the interior whitespace intact.




strip() and removing whitespace

The code below shows the difference between the split() and strip() functions, and how they behave on a string. (We've printed out underscores on either side so you can "see" the exterior whitespace more easily.)

    ride = "\tMerry go\t round\n\n"
    print("Basic: _" + ride + "_")
    print("Stripped: _" + ride.strip() + "_")
    print("Split:", ride.split() )

This outputs:

    Basic: _       Merry go         round
    
    _
    Stripped: _Merry go        round_
    Split: ['Merry', 'go', 'round']

Notice that the strip() function left the interior tab character alone, but that it removed the tab character from the front, and both of the newline characters from the end. The split() function split the string into tokens by removing the interior whitespace, but it also removed all of the leading and trailing whitespace as well.





Writing files

Printing your output to a file instead of the terminal is a great way to store information for future use, or if there is so much information it would overwhelm the user.

To open a file for writing, we use the following command:
        myOutputFile = open("theFile.txt", "w")

This line of code does three things:

  1. It opens the file theFile.txt
  2. The file is opened for writing ("w") — as opposed to reading
  3. The opened file is assigned to the variable myOutputFile

Notice that the only difference between using open() to open a file for writing versus for reading is the letter given to select the access mode: "w" for write, and "r" for read.

If the file we're opening for writing didn't exist already, Python will create it for us, and the new file will start out blank. If an already existing file is opened for writing, the contents of the file are wiped — in other words, the information the file used to contain has been deleted.




Using the write() function

Once we have opened a file and assigned it to a variable, we can use that variable to write to the file, using the appropriately named write() function. Although using write() may seem similar to print(), the two actually work and behave very differently.

The main difference is that write() only takes a single string as a parameter. It can handle multiple strings only if they are concatenated together (since they are evaluated by Python to a single string), but it cannot handle multiple strings, numbers, or multiple things separated by commas.

Remember that when non-string variables are concatenated onto a string, they must first be cast to a string.

    # this will NOT work
    myOutputFile.write(name, "is now", age)
    # this will work
    myOutputFile.write(name + " is now " + str(age))

The other difference is that write() does not automatically add a newline to the end of each printed line. Instead, you must include the escape sequence for a newline ("\n") at the end of the string you are writing. You can add the newline character directly to the end, or you can concatenate it on as a separate "piece" when writing to the file.

    # at the end of the last string
    myOutputFile.write("Line one\n")
    # concatenated to the end of the string
    myOutputFile.write("Line " + str(2) + "\n")




Using multiple files

One very important thing that Python does is it allows us to have more than one file open at a time. The different files can be opened for reading, writing, or appending! They can even be opened in different modes, like one file opened for write and one opened for read. This makes it very easy to directly write important information from one file to another!

For example, if we wanted to read in a file and save a copy of it, but with the copy in all capital letters, the following code would accomplish that:

    inputFile  = open("road.txt", "r")
    outputFile = open("ROAD.txt", "w")
    for line in inputFile:          # read in the input
        capsLine = line.upper()     # convert to all caps
        outputFile.write(capsLine)  # write out the output
    inputFile.close()
    outputFile.close()

If we take a look at the new file using the "emacs" command from the terminal, we can see that everything has been capitalized now:

    bash-4.1$ emacs ROAD.txt
    TWO ROADS DIVERGED IN A YELLOW WOOD,
    AND SORRY I COULD NOT TRAVEL BOTH
    [and so on]