Assignment 4

Goals

The goal of this assignment is to work with files, iterators, strings, and string formatting in Python.

Instructions

You will be doing your work in a Jupyter notebook for this assignment. You may choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local installation of Jupyter and Python. You should use Python 3.10 for your work, but versions 3.8 and 3.9 should also work. To use tiger, use the credentials you received. If you work remotely, make sure to download the .ipynb file to turn in. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab.

In this assignment, we will be working with texts from Project Gutenberg which is a library of eBooks which includes a lot of literature for which U.S. copyright has expired. In this assignment, you will process these files, count words, and convert some of the text. We will use German texts to demonstrate the benefits of Unicode encoding.

A template notebook is provided with a cell to help download the files given a URL. You will read this data, do some calculations, and write a new output file in a similar format. We will be working with two books, the second of which is written in two parts. However, write your code so that it can be adapted to work with other texts as well. The texts are:

Siddhartha, Hermann Hesse.
- https://www.gutenberg.org/cache/epub/2499/pg2499.txt
Heidi, Johanna Spyri.
- Part 1 (Heidis Lehr- und Wanderjahre): https://www.gutenberg.org/cache/epub/7500/pg7500.txt
- Part 2 (Heidi kann brauchen, was es gelernt hat) https://www.gutenberg.org/cache/epub/7512/pg7512.txt

Due Date

The assignment is due at 11:59pm on Tuesday, October 11.

Submission

You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a4.ipynb. You should not turn in the output files (Part 5) as your notebook should contain the code to create it.

Details

Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells. Do not use external libraries for reading, parsing, or writing the data files, but you may use the collections module for counting words.

0. Name & Z-ID (5 pts)

The first cell of your notebook should be a markdown cell with a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.

1. Read Data (10 pts)

Use the download_text function in the provided template notebook to download files. Then, write a function that given the filename, opens the local file and reads the downloaded file, returning the processed lines as a list. Project Gutenberg includes metadata at the beginning of the text and license information at the end of the text. The actual text is contained between the lines

*** START OF <THE|THIS> PROJECT GUTENBERG EBOOK <TITLE> ***

and

*** END OF <THE|THIS> PROJECT GUTENBERG EBOOK <TITLE> ***

where <THE|THIS> indicates that either the word THE or the word THIS is present, and <TITLE> is the name of the book. Read (and discard) lines until you hit the START tag. Then, read in all the lines up to but not including the END tag into a list of strings. Remove any leading and trailing whitespace from the lines. You will need to figure out how to find the START and END tags and which lines should be included.

Hints

You may check for the tags using a regular expression, but python’s string methods should also work.
If you pass an iterator to a for loop, it will loop through the remaining items.
strip will be useful to remove whitespace.

2. Find Non-ASCII Characters (10 pts)

Next, write a function to loop through the list of lines to find the set of all non-ASCII characters in the text. Note that you can check whether a character is in the ASCII subset by checking whether its integer value is less than 128 (recall ASCII mapped the first 128 characters to the values 0 through 127). Return this set of non-ASCII characters. Test your function with the data from one of the books.

Hints

Remember that you will need to iterate through each character
The ord function will be useful to obtain the integer value of a character.

3. Convert Diacritics (15 pts)

Next, write a function that creates a new version of the text where the diacritics (seen in the return value from Part 2) are translate to non-diacritics that could be saved in ASCII. Here, we wish to have the following translations (ä to ae, ö to oe, ü to ue, ß to ss) and upper-case variants (Ä to Ae, Ö to Oe, and Ü to Ue). Use a programmatic way to do the upper-case variants using the lowercase conversions. Your output should be a new list with the converted lines of text. You may not use a series of if or elif statements for this task.

Hints

Consider using a dictionary to map the conversions
Remember for lower/uppercase variation that you can merge two dictionaries
Comprehensions can be useful in applying the modifications to each character

4. Compute Word Frequencies (15 pts)

Next, write a function that computes how many words appear infrequently (<= 4 times) or more frequently (> 4 times). To do this, first remove any punctuation from the strings and then split each line into individual words, converting them to all lowercase characters (updated 2022-10-07). Then, compute the frequency of each word by counting each words occurrences. I recommend using the Counter from the collections module to do this. Then, count the number of words that appear rarely and more frequently and return both. To make sure you are counting correctly, for Heidi, Part 1, the number of infrequent words is ~~8491~~ 4786 and the number of frequent words is ~~1312~~ 1101 (updated 2022-10-07). Depending on how your replace the punctuation (removing entirely versus replacing with spaces) or if you convert the lines first , you may also see 4699 and 1097, or 4768 and 1102, respectively (updated 2022-10-10).

Hints

string.punctuation gives a string with common punctuation marks that should be removed.
The split function should help break a line into words
The update method of Counter will be useful
Remember that Python allows you to return multiple values by separating them with commas

5. Write Output (20 pts)

Write the converted text (from Part 3) to a file. In addition, put your name at the beginning of the file as the converter, then a line marking the start of the text (*** START OF THE EBOOK ***), the converted text, and then a line marking the end of the text (*** END OF THE EBOOK ***). Finally, use your function from Part 4 to compute the number of frequent and infrequent words, and also compute the total number of unique words. The statistics need to be right-aligned so your output should look like:

From Project Gutenberg (www.gutenberg.org)
Converted by: Iamma Huskie

*** START OF THE EBOOK ***

...

*** END OF THE EBOOK ***

Number of infrequent words:       4786
Number of frequent words:         1101
Total unique words:               5887

The name of the file should be the original filename with -converted added. For example, pg7500-converted.txt.

Hints

Use print function calls to write a few data items to stdout first, then when you are satisfied the format is correct, write all of the data items to a file.
Remember the file keyword argument for the print function.
Use a with statement to make sure all data is written to the file (or make sure to call close).
Remember to pass the w flag to open to be able to write to a file.
Consult the Format Specification Mini-Language for the various flags
You can see the contents of the file you wrote in the notebook using the !cat command: !cat pg7500-converted.txt

6. [CSCI 503 Only] Compare Texts (15 pts)

Now, we wish to test the differences between two texts, specifically to investigate the hypothesis that one text has a difference in word reuse than the other. For this part, write a single function that takes two filenames and uses the functions written in Parts 1 and 4 to compute the contingency table showing the number of infrequent and frequent words for each text with their totals. Right-align the results. Then, compute the chi-squared test for the infrequent and frequent words, and test it on Heidi, Part 1 vs. Heidi, Part 2 (same author) and Heidi, Part 1 vs. Siddhartha (different authors). Report the p-value. A p-value < 0.05 indicates the books are likely written by different authors so we can see this test does not work well for our example. Use the function scipy.stats.chi2_contingency to do this calculation. Note that function does not need the totals, only the upper-left 2x2 square. Sample results (updated 2022-10-10):

>>> compare_texts('pg7500.txt', 'pg7512.txt')

               # Infrequent   # Frequent      Total
  pg7500.txt           4786         1101       5887
  pg7512.txt           4278          852       5130
       Total           9064         1953      11017

p-value: 0.004429356727655681

>>> compare_texts('pg2499.txt', 'pg7500.txt')

               # Infrequent   # Frequent      Total
  pg2499.txt           4207          887       5094
  pg7500.txt           4786         1101       5887
       Total           8993         1988      10981

p-value: 0.08446516178458324

Hints

Formatted string literals will be useful for this part of the assignment.
Consult the Format Specification Mini-Language for the various flags

Extra Credit

CSCI 490 Students may complete Part 6 for extra credit
All students may add the book name into the *** lines in Part 5 (5 pts)
All students may add the book name to the contingency tables in Part 6 (5 pts)