The goal of this assignment is to work with files, iterators, strings, and string formatting in Python.
You will be doing your work in a Jupyter notebook for this
assignment. You may choose to work on this assignment on a hosted
environment (e.g. tiger)
or on your own local installation of Jupyter and Python. You should use
Python 3.10 for your work, but versions 3.8 and 3.9 should also work. To
use tiger, use the credentials you received. If you work remotely, make
sure to download the .ipynb file to turn in. If you choose to work
locally, Anaconda is the
easiest way to install and manage Python. If you work locally, you may
launch Jupyter Lab either from the Navigator application or via the
command-line as jupyter-lab
.
In this assignment, we will be working with texts from Project Gutenberg which is a library of eBooks which includes a lot of literature for which U.S. copyright has expired. In this assignment, you will process these files, count words, and convert some of the text. We will use German texts to demonstrate the benefits of Unicode encoding.
A template notebook is provided with a cell to help download the files given a URL. You will read this data, do some calculations, and write a new output file in a similar format. We will be working with two books, the second of which is written in two parts. However, write your code so that it can be adapted to work with other texts as well. The texts are:
The assignment is due at 11:59pm on Tuesday, October 11.
You should submit the completed notebook file required for this
assignment on Blackboard. The
filename of the notebook should be a4.ipynb
. You
should not turn in the output files (Part 5) as your
notebook should contain the code to create it.
Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells. Do not use external libraries for reading, parsing, or writing the data files, but you may use the collections module for counting words.
The first cell of your notebook should be a markdown cell with a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.
Use the download_text
function in the provided template notebook to download files. Then, write
a function that given the filename, opens the local file and reads the
downloaded file, returning the processed lines as a list. Project
Gutenberg includes metadata at the beginning of the text and license
information at the end of the text. The actual text is contained between
the lines
*** START OF <THE|THIS> PROJECT GUTENBERG EBOOK <TITLE> ***
and
*** END OF <THE|THIS> PROJECT GUTENBERG EBOOK <TITLE> ***
where <THE|THIS>
indicates that either the word
THE
or the word THIS
is present, and
<TITLE>
is the name of the book. Read (and discard)
lines until you hit the START tag. Then, read in all the lines
up to but not including the END tag into a list of strings.
Remove any leading and trailing whitespace from the lines. You will need
to figure out how to find the START and END tags and
which lines should be included.
strip
will be useful to remove whitespace.Next, write a function to loop through the list of lines to find the set of all non-ASCII characters in the text. Note that you can check whether a character is in the ASCII subset by checking whether its integer value is less than 128 (recall ASCII mapped the first 128 characters to the values 0 through 127). Return this set of non-ASCII characters. Test your function with the data from one of the books.
ord
function will be useful to obtain the integer
value of a character.Next, write a function that creates a new version of the text where
the diacritics (seen in the return value from Part 2) are translate to
non-diacritics that could be saved in ASCII. Here, we wish to have the
following translations (ä
to ae
,
ö
to oe
, ü
to ue
,
ß
to ss
) and upper-case variants
(Ä
to Ae
, Ö
to Oe
,
and Ü
to Ue
). Use a programmatic way to do the
upper-case variants using the lowercase conversions. Your output should
be a new list with the converted lines of text. You may
not use a series of if or elif statements for this task.
Next, write a function that computes how many words appear
infrequently (<= 4 times) or more frequently (> 4 times). To do
this, first remove any punctuation from the strings and then split each
line into individual words, converting them to all lowercase characters
(updated 2022-10-07). Then, compute the frequency of
each word by counting each words occurrences. I recommend using the
Counter
from the collections
module to do
this. Then, count the number of words that appear rarely and more
frequently and return both. To make sure you are counting correctly, for
Heidi, Part 1, the number of infrequent words is 8491 4786
and the number of frequent words is 1312 1101
(updated 2022-10-07). Depending on how your replace the
punctuation (removing entirely versus replacing with spaces) or if you
convert the lines first , you may also see 4699 and 1097, or 4768 and
1102, respectively (updated 2022-10-10).
string.punctuation
gives a string with common
punctuation marks that should be removed.split
function should help break a line into
wordsupdate
method of Counter
will be
usefulWrite the converted text (from Part 3) to a file. In addition, put
your name at the beginning of the file as the converter, then a line
marking the start of the text (*** START OF THE EBOOK ***
),
the converted text, and then a line marking the end of the text
(*** END OF THE EBOOK ***
). Finally, use your function from
Part 4 to compute the number of frequent and infrequent words, and also
compute the total number of unique words. The statistics need to be
right-aligned so your output should look like:
From Project Gutenberg (www.gutenberg.org)
Converted by: Iamma Huskie
*** START OF THE EBOOK ***
...
*** END OF THE EBOOK ***
Number of infrequent words: 4786
Number of frequent words: 1101
Total unique words: 5887
The name of the file should be the original filename with
-converted
added. For example,
pg7500-converted.txt
.
print
function calls to write a few data items to
stdout
first, then when you are satisfied the format is
correct, write all of the data items to a file.print
function.with
statement to make sure all data is written
to the file (or make sure to call close
).w
flag to open
to be
able to write to a file.!cat
command:
!cat pg7500-converted.txt
Now, we wish to test the differences between two texts, specifically to investigate the hypothesis that one text has a difference in word reuse than the other. For this part, write a single function that takes two filenames and uses the functions written in Parts 1 and 4 to compute the contingency table showing the number of infrequent and frequent words for each text with their totals. Right-align the results. Then, compute the chi-squared test for the infrequent and frequent words, and test it on Heidi, Part 1 vs. Heidi, Part 2 (same author) and Heidi, Part 1 vs. Siddhartha (different authors). Report the p-value. A p-value < 0.05 indicates the books are likely written by different authors so we can see this test does not work well for our example. Use the function scipy.stats.chi2_contingency to do this calculation. Note that function does not need the totals, only the upper-left 2x2 square. Sample results (updated 2022-10-10):
>>> compare_texts('pg7500.txt', 'pg7512.txt')
# Infrequent # Frequent Total
pg7500.txt 4786 1101 5887
pg7512.txt 4278 852 5130
Total 9064 1953 11017
p-value: 0.004429356727655681
>>> compare_texts('pg2499.txt', 'pg7500.txt')
# Infrequent # Frequent Total
pg2499.txt 4207 887 5094
pg7500.txt 4786 1101 5887
Total 8993 1988 10981
p-value: 0.08446516178458324
***
lines
in Part 5 (5 pts)