Assignment 4

Goals

The goal of this assignment is to work with files, iteartion, encodings, strings, and string formatting in Python.

Instructions

You will be doing your work in a Jupyter notebook for this assignment. You may choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local installation of Jupyter and Python. You should use Python 3.10 for your work, but versions 3.8 and 3.9 should also work. To use tiger, use the credentials you received. If you work remotely, make sure to download the .ipynb file to turn in. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab.

In this assignment, we will be working with texts from Project Gutenberg which is a library of eBooks which includes a lot of literature for which U.S. copyright has expired. In this assignment, you will process these files, count words, and convert some of the text. We will use Portuguese texts to example the effects of different encodings.

A template notebook is provided with a cell to help download the files given a URL. You will read this data, do some processing, and write a new output file in a similar format. You should write your code so that it can be adapted to work with other texts, but make sure that it runs for the following three texts. Note that the encoding is specified for each.

Due Date

The assignment is due at 11:59pm on Wednesday, March 8.

Submission

You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a4.ipynb. You should not turn in the output files as your notebook should contain the code to create it.

Details

Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells. Do not use external libraries for reading, parsing, or writing the data files, but you may use the regular expressions module and string methods as necessary. CSCI 490 students do not need to complete Part 4; it is extra credit for them.

0. Name & Z-ID (5 pts)

The first cell of your notebook should be a markdown cell with a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.

1. Read Data (10 pts)

Use the download_text function in the provided template notebook to download files. Then, write a function that given the filename and encoding, opens the local file and reads the downloaded file, returning the processed lines as a list. Project Gutenberg includes metadata at the beginning of the text and license information at the end of the text. The actual text is contained between the lines

*** START OF THIS PROJECT GUTENBERG EBOOK <TITLE> ***

and

*** END OF THIS PROJECT GUTENBERG EBOOK <TITLE> ***

where <TITLE> is the name of the book. Read (and discard) lines until you hit the START tag. Then, read in all the lines up to but not including the END tag into a list of strings. Remove any leading and trailing whitespace from the lines. You will need to figure out how to find the START and END tags and which lines should be included.

Hints
  • Note that the first text uses iso-8859-1 encoding while the others use utf-8.
  • You may check for the tags using a regular expression, but python’s string methods should also work.
  • If you pass an iterator to a for loop, it will loop through the remaining items.
  • strip will be useful to remove whitespace.

2. Find Non-ASCII Characters (10 pts)

Next, write a function to loop through the list of lines to find the set of all unique non-ASCII characters in the text. Note that you can check whether a character is in the ASCII subset by checking whether its integer value is less than 128 (recall ASCII mapped the first 128 characters to the values 0 through 127). Return this set of non-ASCII characters. Test your function with the data from one of the books. For the first book (17610-8.txt), you should find 22 unqiue non-ASCII characters.

Hints
  • Remember that you will need to iterate through each character
  • The ord function will be useful to obtain the integer value of a character.

3. Convert Quotes (10 pts)

In the works by Garrett, quotation marks that indcate the start and end of the quote are used (‘,’,“,”). In addition, two of the books use the « and » symbols. We wish to change all of these quotation marks to corresponding non-stylized versions ' and ". Convert the « and » symbols to double quotes. Write a method to convert these quotation marks for the book contents read in in Part 1.

Hints
  • Consider using a dictionary to map the conversions
  • Comprehensions can be useful in applying the modifications to each character

4. [CSCI 503 Only] Convert Underscores (Italics) (15 pts)

From the output from Part 3, we now wish to change all the _italicized phrases_ to use the HTML tags <i> and </i>. Here, any time we see an underscore preceded by a non-whitespace character and followed by a word boundary, we want to change it to </i>, and if that is not true but it is preceded by a word boundary and succeeded by a non-whitespace character, we want to change it to <i>. Write a method to convert these character sequences for the book contents read in in Part 1. If you count the number of <i> and </i> substrings after conversion, they should be the same.

Hints
  • Use regular expressions.
  • The order you apply the regular expressions matters.
  • Rember to capture the text that comes before or after the word so you can use it in the substitution.
  • \b is a zero-width character class that indicates a word boundary. It can be useful to test if you have an underscore before or after a word.
  • Check the character classes to find one for non-space characters

5. Count Number of Headers and Roman Numerals (15 pts)

As books of poetry, each poem is often identified with a title. In the books we have been looking at, these are indicated with lines that only have capital letters. Note that these lines could also indicate chapters or other headers. However, some roman numerals are also used to separate sections. Write a function to count the number of lines that are headers and the number of lines that are roman numerals in the text. A header is a line written in all captial letters that is not just a roman numeral. You can check whether a line is a roman numeral using the following regular expression from the Regular Expressions Cookbook that assumes you have removed all leading and trailing whitespace first:

r'^(?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$'

For the file 63438-0.txt, I found 110 headers and 105 roman numerals. However, these counts depend on the type of check you do. For example, I am assume that numbers or punctuation are allowed, but that might also allow a header with just numbers.

Hints:
  • Remember that there is a function to check the casing of python strings
  • You only need to use the regular expression for this part. There is no need to modify it or understand exactly what it is doing

6. Write Output (20 pts)

Write the converted text (from Part 3 or 4) to a file using the utf-8 encoding. (If you are in CSCI 503, this means applying the functions from both Parts 3 and 4.) In addition, put your name at the beginning of the file as the converter, then a line marking the start of the text (*** START OF THE EBOOK ***), the converted text, and then a line marking the end of the text (*** END OF THE EBOOK ***). Finally, use your function from Part 5 to write the number of headers and roman numerals. The statistics need to be right-aligned so your output should look like:

From Project Gutenberg (www.gutenberg.org)
Converted by: Iamma Huskie

*** START OF THE EBOOK ***

...

*** END OF THE EBOOK ***

Number of Headers:                110
Number of Roman Numerals:         105

The name of the file should be the original filename with -converted added. For example, 63438-0-converted.txt.

Hints
  • Use print function calls to write a few data items to stdout first, then when you are satisfied the format is correct, write all of the data items to a file.
  • Remember the file keyword argument for the print function.
  • Use a with statement to make sure all data is written to the file (or make sure to call close).
  • Remember to pass the w flag to open to be able to write to a file.
  • Consult the Format Specification Mini-Language for the various flags
  • You can see the contents of the file you wrote in the notebook by opening the file in Jupyter or by using the !cat command: !cat 63438-0-converted.txt

Extra Credit

  • CSCI 490 Students may complete Part 4 for extra credit
  • All students may add the book name into the *** lines in Part 6 (5 pts)