Assignment 4

Goals

The goal of this assignment is to work with files, iterators, strings, and string formatting in Python.

Instructions

You will be doing your work in a Jupyter notebook for this assignment. You may choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local installation of Jupyter and Python. You should use Python 3.12 for your work. (Older versions may work, but your code will be checked with Python 3.12.) To use tiger, use the credentials you received. If you work remotely, make sure to download the .ipynb file to turn in. If you choose to work locally, Anaconda or miniforge are probably the easiest ways to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application (anaconda) or via the command-line as jupyter-lab or jupyter lab.

In this assignment, we will be working with texts from Project Gutenberg which is a library of eBooks which includes a lot of literature for which U.S. copyright has expired. In this assignment, you will process these files, count words, and convert some of the text. We will use German texts to demonstrate the benefits of Unicode encoding.

A template notebook is provided with a cell to help download the files given a URL. You will read this data, do some calculations, and write a new output file in a similar format. We will be working with three books, one from Spyri and two from Kafka. However, write your code so that it can be adapted to work with other texts as well. The texts are:

Heidi, Johanna Spyri. https://www.gutenberg.org/cache/epub/7500/pg7500.txt
Die Verwandlung, Franz Kafka. https://www.gutenberg.org/cache/epub/22367/pg22367.txt
Der Prozess, Franz Kafka. https://www.gutenberg.org/cache/epub/69327/pg69327.txt

Due Date

The assignment is due at 11:59pm on Tuesday, March 4.

Submission

You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a4.ipynb. You should not turn in the output files (Part 5) as your notebook should contain the code to create it.

Details

Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells. Do not use external libraries for reading, parsing, or writing the data files, but you may use the collections module for counting words.

0. Name & Z-ID (5 pts)

The first cell of your notebook should be a markdown cell with a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.

1. Read Data (10 pts)

Use the download_text function in the provided template notebook to download files. Then, write a function that given the filename, opens the local file and reads the downloaded file, returning the processed lines as a list. Project Gutenberg includes metadata at the beginning of the text and license information at the end of the text. The actual text is contained between the lines

*** START OF THE PROJECT GUTENBERG EBOOK <TITLE> ***

and

*** END OF THE PROJECT GUTENBERG EBOOK <TITLE> ***

where <TITLE> is the name of the book. Read (and discard) lines until you hit the START tag. Then, read in all the lines up to but not including the END tag into a list of strings. Remove any leading and trailing whitespace from the lines. You will need to figure out how to find the START and END tags and which lines should be included.

Hints

You may check for the tags using a regular expression, but python’s string methods should also work.
If you pass an iterator to a for loop, it will loop through the remaining items.
strip will be useful to remove whitespace.

2. Find Non-ASCII Characters (10 pts)

Next, write a function to loop through the list of lines to find the set of all non-ASCII characters in the text. Note that you can check whether a character is in the ASCII subset by checking whether its integer value is less than 128 (recall ASCII mapped the first 128 characters to the values 0 through 127). Return this set of non-ASCII characters. Test your function with the data from the books.

Hints

Remember that you will need to iterate through each character
The ord function will be useful to obtain the integer value of a character.

3. Convert Diacritics (15 pts)

Next, write a function that creates a new version of the text where the diacritics (seen in the return value from Part 2) are translate to non-diacritics that could be saved in ASCII. Here, we wish to have the following translations (ä to ae, ö to oe, ü to ue, ß to ss) and upper-case variants (Ä to Ae, Ö to Oe, and Ü to Ue). Use a programmatic way to do the upper-case variants using the lowercase conversions. Your output should be a new list with the converted lines of text. You may not use a series of if or elif statements for this task.

Hints

Consider using a dictionary to map the conversions
Remember for lower/uppercase variation that you can merge two dictionaries
Comprehensions can be useful in applying the modifications to each character

4. Compute Pronoun Usage (15 pts)

Next, write a function that computes how often each of the German personal pronouns (“ich”, “du”, “er”, “sie”, “es”, “wir”, “ihr”) occurs and returns those counts as a dictionary. To do this, first remove any punctuation from the strings and then split each line into individual words, converting them to all lowercase characters. Then, compute the frequency of each word by counting each words occurrences. I recommend using the Counter from the collections module to do this. Then, count the number of each pronoun used and the total number of pronouns. To make sure you are counting correctly, for Heidi, ich is used 456 times, du is used 391 times, and there are 3336 total pronoun usages. Depending on how your replace the punctuation (removing entirely versus replacing with spaces) or if you convert the lines first, your counts may vary.

Hints

string.punctuation gives a string with common punctuation marks that should be removed.
The split function should help break a line into words
The update method of Counter will be useful

5. Write Output (20 pts)

Write the converted text (from Part 3) to a file. In addition, put your name at the beginning of the file as the converter, then a line marking the start of the text (*** START OF THE EBOOK ***), the converted text, and then a line marking the end of the text (*** END OF THE EBOOK ***). Finally, use your function from Part 4 to compute the number of pronouns, and also compute the total number of pronouns. The statistics need to be right-aligned so your output should look like:

From Project Gutenberg (www.gutenberg.org)
Converted by: Iamma Huskie

*** START OF THE EBOOK ***

...

*** END OF THE EBOOK ***

ich usage:          456
du usage:           391
er usage:           567
sie usage:          589
es usage:          1110
wir usage:           75
ihr usage:          148
Total pronouns:    3336

The name of the file should be the original filename with -converted added. For example, pg7500-converted.txt.

Hints

Use print function calls to write a few data items to stdout first, then when you are satisfied the format is correct, write all of the data items to a file.
Remember the file keyword argument for the print function.
Use a with statement to make sure all data is written to the file (or make sure to call close).
Remember to pass the w flag to open to be able to write to a file.
Consult the Format Specification Mini-Language for the various flags
You can use nested curly braces to write the width specification as a python expression.
You can see the contents of the file you wrote in the notebook using the !cat command: !cat pg7500-converted.txt

6. [CSCI 503 Only] Compare Texts (15 pts)

Now, we wish to test the differences between two texts, specifically to investigate the hypothesis that different authors use pronouns differently. For this part, write a single function that takes two filenames and uses the functions written in Parts 1 and 4 to compute the contingency table showing the number of uses of “ich” and “du” along with the sum of the counts for those two pronouns. Right-align the results. Then, compute the chi-squared test for the ich and du usages, and test it on Die Verwandlung vs. Der Prozess (same author) and Heidi vs. Der Prozess (different author). Report the p-value. A p-value < 0.05 indicates the books are likely written by different authors. Use the function scipy.stats.chi2_contingency to do this calculation. Note that function does not need the totals, only the upper-left 2x2 square. Sample results:

>>> compare_texts('pg69327.txt', 'pg22367.txt')

                      # Ich         # Du      Total
 pg69327.txt            844          125        969
 pg22367.txt             61            6         67
       Total            905          131       1036

p-value: 0.4535378271265039

>>> compare_texts('pg7500.txt', 'pg69327.txt')

                      # Ich         # Du      Total
  pg7500.txt            456          391        847
 pg69327.txt            844          125        969
       Total           1300          516       1816

p-value: 4.755567980204363e-55

Hints

Formatted string literals will be useful for this part of the assignment.
Consult the Format Specification Mini-Language for the various flags

Extra Credit

CSCI 490 Students may complete Part 6 for extra credit
All students may add the book name into the *** lines in Part 5 (5 pts)
All students may add the book name to the contingency tables in Part 6 (5 pts)