CSCI 680/490 - Assignment 2

Goals

The goal of this assignment is to learn how to use pandas for data analysis tasks.

Instructions

You may choose to work on this assignment on a hosted environment (e.g. Google Colab) or on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab.

In this assignment, we will analyze the Information Wanted dataset which tracks advertisements for those looking for lost friends and relatives. This dataset is provided by Boston College based on work supervised by Dr. Ruth-Ann Harris and is documented here. In contrast to Assignment 1, we will now be working on the raw data, available here. You will do some analysis of this data to answer some questions about it. I have provided code to organize this data, but you may feel free to improve this rudimentary organization. Note that you may choose to organize the cells as you wish, but you must properly label each problem’s code and its solution. Use a Markdown cell to add a header denoting your work for a particular problem. Make sure to document what your answer is to the question, and make sure your code actually computes that. We will do similar types of analysis as in Assignment 1 but now using pandas. You must properly use pandas to receive full credit; do not use loops in this assignment.

Due Date

The assignment is due at 11:59pm on Friday, February 12.

Submission

You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a2.ipynb. Please use Markdown headings to clearly separate your work for each of the parts. Also make sure that your notebook runs correctly before handing it in using the “Run All” command to run all of the cells in top-down order.

Details

In this assignment, we will use the same info wanted data, but in the raw format. You can download the info_wanted_raw.csv data here. In addition, you may use the following code in your notebook to download the file so it is available for further work:

import os
from urllib.request import urlretrieve

# download the data if we don't have it locally
url = "http://faculty.cs.niu.edu/~dakoop/cs680-2021sp/a2/info_wanted_raw.csv"
local_fname = "info_wanted_raw.csv"
if not os.path.exists("info_wanted_raw.csv"):
    urlretrieve(url, local_fname)

There are fifty columns in the dataset, but we will not use all of them for this assignment. Those we may use include:

recid: record identifier
mm: the month
dd: the day
yy: the year, a two-digit number that represents the year between 1830 and 1920
firstname: the first name of the person being sought
surname: the surname of the person being sought
sex: the sex of the person being sought, if specified
age: the age of the person being sought, if specified
seek_surname: the surname of the person seeking information
seek_first: the first name of the person seeking information

1. Names (15 pts)

We again wish to compute the number of unique first names and the most frequent name. First, we must load the data. Pandas has a read_csv method that will load a dataset into a DataFrame object. You may get a warning when loading the file about four columns. This is because the file has been sorted so that certain fields are filled for the beginning entries and not for the rest. You can fix these by setting the dtype parameter so that these fields are treated as strings (str), but you can also ignore because we will not be using those columns in this assignment.

a. Number of Unique First Names (10 pts)

Using your loaded data frame, compute the number of unique first names. Because we are using the full, raw dataset, the result will be different from Assignment 1. Remember (a) to strip whitespace and (b) that a missing value is not a name, and missing values are set to NaN in pandas.

b. Most Frequent First Name (5 pts)

Using the same data frame, compute the most frequent first name.

Hints

Pandas has a number of string methods which are helpful in processing a column of strings with a single command. Remember to add the str. prefix to the method.
The unique method is useful for calculating unique values.
The isnull and notnull methods will help find those values that are (not) null.
The value_counts method is useful for counting the occurrences of values.

2. Maximum Age (10 pts)

Next, we want to compute the maximum known age of a person being sought in the dataset. The problem is that this attribute contains values that are not numbers so we cannot just directly compare. The pandas method to_numeric will convert values to numeric values, but you will need to read the documentation to find a parameter setting that will address this problem. Then, find the maximum value.

Hints

Pandas has a number of methods to compute descriptive statistics.

3. Matching Surnames (10 pts)

Now, we want to find those entries where the person being sought has the same surname (last name) as the person seeking. This is expected to be fairly common situation. Use boolean indexing to select those situations, but relax the condition to allow different casings to match (e.g. “Smith”, “smith”, and “SMITH” should match). Also, remember to strip any whitespace before comparison. Find the total number of entries with matching surnames.

Hints

Pandas has a number of string methods which are helpful in processing a column of strings with a single command. Remember to add the str. prefix to the method.

4. Most Recent Entries (15 pts)

Now, we need to find the surname for the most recent entries. To find the dates for the most recent entries, we can try to put together some calculation using the given yy, mm, and dd fields, but it will be more useful to create actual dates. Pandas has a to_datetime method that helps. Reading the documentation, you will see that it expects the fields to be named ‘year’, ‘month’, and ‘day’. Rename the columns accordingly. Note, however, that the year is only two digits instead of four. Because we know that the dates of the collected data range from 1831 to 1920, we can write code to add the appropriate two-digit prefix to those years. Now, we can run to_datetime, but recognize that some dates will be invalid (e.g. a month value of -6). Look at the documentation, and find a parameter setting to address this problem. Add a column (‘timestamp’) with this datetime value. Then, find the most recent entry date, and obtain the common surname among the entries matching that date.

Hints

Pandas has a rename method to rename columns; note that it returns a new data frame.
You can create new columns in an existing data frame using assignment (df['new_column'] = ...)
The parameter setting for addressing invalid datetimes will parallel the one for to_numeric.

5. [CS680 Students] Incorrect Values (20 pts)

It is often instructive to figure out where data is dirty, and what may have gone wrong.

a. Non-numeric Ages (10 pts)

In Part 2, we found the maximum age by dealing with non-numeric ages, but we now want to know which values could not be converted. The problem is that many entries also had no value set (indicated by the NaN value), and need to be excluded before considering checks of numeric values. The isnumeric string method will help us find those valid entires, but we need negation to find those that are invalid. Report the three unique values that are not strictly numeric.

b. Fixing Incorrect Dates (10 pts)

Here, we want to find the dates that are invalid, and for two of those values, investigate what the dates might actually be. In contrast to age, every entry must have a yy, mm, and dd set. Thus, when we created the timestamp, only those entries where the process didn’t work will be different. Output those 12 entries. Among these entries are those for the McKinlays and the Buckleys. For those two sets of entries, determine a likely value for those date values (two dates, one for the McKinlays info wanted, one for the Buckleys info wanted). The entries in the csv file are largely in order so it is useful to see the entries surrounding them.

Extra Credit

CS 490 students may complete Part 5 for Extra Credit.