The goal of this assignment is to get acquainted with Python using Jupyter Notebooks.
You may choose to work on this assignment on a hosted environment (e.g. Google Colab) or on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab
.
In this assignment, we will analyze the Information Wanted dataset which tracks advertisements for those looking for lost friends and relatives. This dataset is provided by Boston College based on work supervised by Dr. Ruth-Ann Harris and is documented here. We will be working on a subset of this data, available here. You will do some analysis of this data to answer some questions about it. I have provided code to organize this data, but you may feel free to improve this rudimentary organization. I have also provided functions that allow you to check your work. Note that you may choose to organize the cells as you wish, but you must properly label each problem’s code and its solution. Use a Markdown cell to add a header denoting your work for a particular problem. Make sure to document what your answer is to the question, and make sure your code actually computes that. As the goal of this assignment is to become acquainted with core Python, do not use other libraries except for the csv
and collections
library.
You may start with the provided Jupyter Notebook, a1.ipynb. Download this notebook (right-click to save the link) and upload it to your Jupyter workspace (either locally or on a hosted environment). Make sure to execute the first two cells in the notebook (Shift+Enter). The second cell will download the data and define two variables field_names
and records
. The field_names
variable is a string with the names of each data attribute, separated by commas. The records
variable is a list of comma-delimited strings with the values of each field for each data item. The field names are
recid
: record identifiermm
: the monthdd
: the dayyy
: the year, a two-digit number that represents the year between 1830 and 1920firstname
: the first name of the person being soughtsurname
: the surname of the person being soughtsex
: the sex of the person being sought, if specifiedage
: the age of the person being sought, if specifiedseek_surname
: the surname of the person seeking informationseek_first
: the first name of the person seeking informationTo access the fourth entry’s year, you would access records[3][3]
. Remember indexing is zero-based!
In the provided file, I provided examples of how to check your work. For example, for Problem 1, you would call the check1
function with the number of unique first names. After executing this function, you will see a message that indicates whether your answer is correct.
The assignment is due at 11:59pm on Monday, February 1.
You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a1.ipynb
.
Write code that computes the number of unique first names of people being sought in the dataset. Note that the empty string is not a valid first name.
split
function for strings will be usefulstrip
function will also be useful to trim whitespaceset
to keep track of all the namesWrite code that computes the most frequent first name among those being sought. Again, the empty string does not count!
collections.Counter()
is a good structure to help with counting.Write code that computes the age of the oldest person being sought. Note that you will need to ignore those age values that are not numbers. You may use a try-except block for this.
int("81")
returns an integer value of 81.Write code that computes the number of entries where the person seeking information has the same surname as the person being sought. Assume capitalization does not matter; thus “Smith” is the same as “smith”. Also, a match of empty strings for surnames does not count.
Write code that determines the date and surname of the persons being sought in the most recent entries in the dataset. Be careful–the dataset uses two-digit years but encompasses data from 1831 to 1920.