Goals

The goal of this assignment is to get acquainted with Python using Jupyter Notebooks.

Instructions

You may choose to work on this assignment on a hosted environment (e.g. Google Colab) or on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab.

In this assignment, we will analyze the Information Wanted dataset which tracks advertisements for those looking for lost friends and relatives. This dataset is provided by Boston College based on work supervised by Dr. Ruth-Ann Harris and is documented here. We will be working on a subset of this data, available here. You will do some analysis of this data to answer some questions about it. I have provided code to organize this data, but you may feel free to improve this rudimentary organization. I have also provided functions that allow you to check your work. Note that you may choose to organize the cells as you wish, but you must properly label each problem’s code and its solution. Use a Markdown cell to add a header denoting your work for a particular problem. Make sure to document what your answer is to the question, and make sure your code actually computes that. As the goal of this assignment is to become acquainted with core Python, do not use other libraries except for the csv and collections library.

You may start with the provided Jupyter Notebook, a1.ipynb. Download this notebook (right-click to save the link) and upload it to your Jupyter workspace (either locally or on a hosted environment). Make sure to execute the first two cells in the notebook (Shift+Enter). The second cell will download the data and define two variables field_names and records. The field_names variable is a string with the names of each data attribute, separated by commas. The records variable is a list of comma-delimited strings with the values of each field for each data item. The field names are

  1. recid: record identifier
  2. mm: the month
  3. dd: the day
  4. yy: the year, a two-digit number that represents the year between 1830 and 1920
  5. firstname: the first name of the person being sought
  6. surname: the surname of the person being sought
  7. sex: the sex of the person being sought, if specified
  8. age: the age of the person being sought, if specified
  9. seek_surname: the surname of the person seeking information
  10. seek_first: the first name of the person seeking information

To access the fourth entry’s year, you would access records[3][3]. Remember indexing is zero-based!

In the provided file, I provided examples of how to check your work. For example, for Problem 1, you would call the check1 function with the number of unique first names. After executing this function, you will see a message that indicates whether your answer is correct.

Due Date

The assignment is due at 11:59pm on Monday, February 1.

Submission

You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a1.ipynb.

Details

1. Number of Unique First Names (10 pts)

Write code that computes the number of unique first names of people being sought in the dataset. Note that the empty string is not a valid first name.

Hints:
  • You will need to extract the first name from each record string
  • The split function for strings will be useful
  • The strip function will also be useful to trim whitespace
  • Consider using a set to keep track of all the names

2. Most Frequent First Name (10 pts)

Write code that computes the most frequent first name among those being sought. Again, the empty string does not count!

Hints:
  • collections.Counter() is a good structure to help with counting.
  • Clean up the strings in the same manner as in Problem 1.

3. Age of the Oldest Person (10 pts)

Write code that computes the age of the oldest person being sought. Note that you will need to ignore those age values that are not numbers. You may use a try-except block for this.

Hints:
  • You can convert a string to an integer by casting it. For example, int("81") returns an integer value of 81.
  • Try converting an invalid age string to an integer to see which error to catch.

4. Same Surname (10 pts)

Write code that computes the number of entries where the person seeking information has the same surname as the person being sought. Assume capitalization does not matter; thus “Smith” is the same as “smith”. Also, a match of empty strings for surnames does not count.

Hints:
  • Look for a Python method that will convert all characters to the same case.
  • Remember to check that the surnames are not empty

5. Date & Surname of the Latest Entry (10 pts)

Write code that determines the date and surname of the persons being sought in the most recent entries in the dataset. Be careful–the dataset uses two-digit years but encompasses data from 1831 to 1920.

Hints:
  • All of the persons being sought on the most recent date have the same surname
  • 1920 - 1831 < 100
  • If you compare two lists or tuples, Python will do this entry by entry, but make sure to put the data in the correct order.