The goal of this assignment is to work with files, iterators, strings, and string formatting in Python.
You will be doing your work in a Jupyter notebook for this
assignment. You may choose to work on this assignment on a hosted
environment (e.g. tiger)
or on your own local installation of Jupyter and Python. You should use
Python 3.12 for your work. (Older versions may work, but your code will
be checked with Python 3.12.) To use tiger, use the credentials you
received. If you work remotely, make sure to download the .ipynb file to
turn in. If you choose to work locally, Anaconda or miniforge are
probably the easiest ways to install and manage Python. If you work
locally, you may launch Jupyter Lab either from the Navigator
application (anaconda) or via the command-line as
jupyter-lab
or jupyter lab
.
In this assignment, we will be working with texts from Project Gutenberg which is a library of eBooks which includes a lot of literature for which U.S. copyright has expired. In this assignment, you will process these files, count words, and convert some of the text. We will use German texts to demonstrate the benefits of Unicode encoding.
A template notebook is provided with a cell to help download the files given a URL. You will read this data, do some calculations, and write a new output file in a similar format. We will be working with three books, one from Spyri and two from Kafka. However, write your code so that it can be adapted to work with other texts as well. The texts are:
The assignment is due at 11:59pm on Tuesday, March 4.
You should submit the completed notebook file required for this
assignment on Blackboard. The
filename of the notebook should be a4.ipynb
. You
should not turn in the output files (Part 5) as your
notebook should contain the code to create it.
Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells. Do not use external libraries for reading, parsing, or writing the data files, but you may use the collections module for counting words.
The first cell of your notebook should be a markdown cell with a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.
Use the download_text
function in the provided template notebook to download files. Then, write
a function that given the filename, opens the local file and reads the
downloaded file, returning the processed lines as a list. Project
Gutenberg includes metadata at the beginning of the text and license
information at the end of the text. The actual text is contained between
the lines
*** START OF THE PROJECT GUTENBERG EBOOK <TITLE> ***
and
*** END OF THE PROJECT GUTENBERG EBOOK <TITLE> ***
where <TITLE>
is the name of the book. Read (and
discard) lines until you hit the START tag. Then, read in all
the lines up to but not including the END tag into a list of
strings. Remove any leading and trailing whitespace from the lines. You
will need to figure out how to find the START and END
tags and which lines should be included.
strip
will be useful to remove whitespace.Next, write a function to loop through the list of lines to find the set of all non-ASCII characters in the text. Note that you can check whether a character is in the ASCII subset by checking whether its integer value is less than 128 (recall ASCII mapped the first 128 characters to the values 0 through 127). Return this set of non-ASCII characters. Test your function with the data from the books.
ord
function will be useful to obtain the integer
value of a character.Next, write a function that creates a new version of the text where
the diacritics (seen in the return value from Part 2) are translate to
non-diacritics that could be saved in ASCII. Here, we wish to have the
following translations (ä
to ae
,
ö
to oe
, ü
to ue
,
ß
to ss
) and upper-case variants
(Ä
to Ae
, Ö
to Oe
,
and Ü
to Ue
). Use a programmatic way to do the
upper-case variants using the lowercase conversions. Your output should
be a new list with the converted lines of text. You may
not use a series of if or elif statements for this task.
Next, write a function that computes how often each of the German
personal pronouns (“ich”, “du”, “er”, “sie”, “es”, “wir”, “ihr”) occurs
and returns those counts as a dictionary. To do this, first remove any
punctuation from the strings and then split each line into individual
words, converting them to all lowercase characters. Then, compute the
frequency of each word by counting each words occurrences. I recommend
using the Counter
from the collections
module
to do this. Then, count the number of each pronoun used and the total
number of pronouns. To make sure you are counting correctly, for Heidi,
ich is used 456 times, du is used 391 times, and there are 3336 total
pronoun usages. Depending on how your replace the punctuation (removing
entirely versus replacing with spaces) or if you convert the lines
first, your counts may vary.
string.punctuation
gives a string with common
punctuation marks that should be removed.split
function should help break a line into
wordsupdate
method of Counter
will be
usefulWrite the converted text (from Part 3) to a file. In addition, put
your name at the beginning of the file as the converter, then a line
marking the start of the text (*** START OF THE EBOOK ***
),
the converted text, and then a line marking the end of the text
(*** END OF THE EBOOK ***
). Finally, use your function from
Part 4 to compute the number of pronouns, and also compute the total
number of pronouns. The statistics need to be right-aligned so your
output should look like:
From Project Gutenberg (www.gutenberg.org)
Converted by: Iamma Huskie
*** START OF THE EBOOK ***
...
*** END OF THE EBOOK ***
ich usage: 456
du usage: 391
er usage: 567
sie usage: 589
es usage: 1110
wir usage: 75
ihr usage: 148
Total pronouns: 3336
The name of the file should be the original filename with
-converted
added. For example,
pg7500-converted.txt
.
print
function calls to write a few data items to
stdout
first, then when you are satisfied the format is
correct, write all of the data items to a file.print
function.with
statement to make sure all data is written
to the file (or make sure to call close
).w
flag to open
to be
able to write to a file.!cat
command:
!cat pg7500-converted.txt
Now, we wish to test the differences between two texts, specifically to investigate the hypothesis that different authors use pronouns differently. For this part, write a single function that takes two filenames and uses the functions written in Parts 1 and 4 to compute the contingency table showing the number of uses of “ich” and “du” along with the sum of the counts for those two pronouns. Right-align the results. Then, compute the chi-squared test for the ich and du usages, and test it on Die Verwandlung vs. Der Prozess (same author) and Heidi vs. Der Prozess (different author). Report the p-value. A p-value < 0.05 indicates the books are likely written by different authors. Use the function scipy.stats.chi2_contingency to do this calculation. Note that function does not need the totals, only the upper-left 2x2 square. Sample results:
>>> compare_texts('pg69327.txt', 'pg22367.txt')
# Ich # Du Total
pg69327.txt 844 125 969
pg22367.txt 61 6 67
Total 905 131 1036
p-value: 0.4535378271265039
>>> compare_texts('pg7500.txt', 'pg69327.txt')
# Ich # Du Total
pg7500.txt 456 391 847
pg69327.txt 844 125 969
Total 1300 516 1816
p-value: 4.755567980204363e-55
***
lines
in Part 5 (5 pts)