CSCI 680/490 - Assignment 3

Goals

The goal of this assignment is to use Trifacta Wrangler and pandas to perform data cleaning tasks.

Instructions

You will need to create a free account for Trifacta Wrangler and use the hosted environment there for the first part of the assignment. You may choose to work on the second part of the assignment on a hosted environment (e.g. Google Colab) or on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work (although Colab’s 3.6 should work). To use cloud resources, create/login to a free Google account for Colab. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab.

I Information Wanted dataset which tracks advertisements for those looking for lost friends and relatives. This dataset is provided by Boston College based on work supervised by Dr. Ruth-Ann Harris and is documented here. Our goal is to clean the dataset to make it easier to glean information from.

Due Date

The assignment is due at 11:59pm on Wednesday, March 3.

Submission

You should submit both (a) the completed Trifacta recipe as an a3.wrangle file and (b) the completed notebook file named a3.ipynb on Blackboard. Note that you may do the parts separately as well, and if so, please turn in all recipes/notebooks (e.g. a3-part1.wrangle and a3-part2.wrangle). You may need to zip the files together to turn them in via Blackboard. Please use Markdown headings to clearly separate your work in the notebook.

Details

In both parts of the assignment, we will be transforming and cleaning the dataset from its raw form. The set of changes include:

Converting the three date columns to a single timestamp and fixing broken timestamps
Fixing ages that are not integers
Transforming known location columns
Extracting place and date information from the location columns
Extracting, cleaning, standardizing the place names into country and state columns [680 Students]

CS 680 students are responsible for all tasks. CS 490 students need not extract states and countries.

1. Fixing Dates and Ages in Trifacta Wrangler

Start by signing up for a free Wrangler account; click the “Start Wrangler” link here. Please note that Wrangler requires the Chrome browser. Documentation is available here. After your recipe(s) is completed, show the Recipe, select the gears icon, and select “Download Recipe as wrangle”. Turn the recipe file(s) in as your solution for Parts 1 and 2.

a. Load Data

Begin by downloading the raw dataset from the course website to your computer. After signing in to Wrangler, click the “Import Data” button on the upper-right side of the home screen, and import the dataset you downloaded. You can choose the “Import and Wrangle” button to immediately start working. If you choose “Import” instead or come back to this later, you can go to the flows icon and create a new flow, selecting the dataset you have imported.

b. Cleanup quotes (5 pts)

Because of the way Trifacta loads CSV data, some entries will be surrounded by double-quotes. We don’t want these. Add a step to delete quotes when they are at the beginning and end of any string. You can use a replace method over all columns to do this.

Hints

In Trifacta’s pattern matching syntax, you can specify the start and end via {start} and {end} and capture the middle portion of the text via parentheses ({any}*), in a similar way to regular expressions in Python.

c. Standardize and Fill Date Information (15 pts)

Create a new column that converts the yy column to the appropriate date. As with Assignment 2, you will need to do this by using the knowledge that the two digit date must map to a year from 1830 to 1920. You may use some math or a if-then statement to compute the year, and then the DATEFORMAT and DATE methods to create a new date. Note, however, that you will likely have to change the type of the new year column to integer before the date conversion will work.

Now, look at those dates that are invalid by clicking on the far right of the bar in the new date column to highlight the eleven missing values. Rather than scrolling through these, click the checkbox in the lower right to only show the affected rows. Now replace the missing values by filling down (using the values directly above the problematic dates).

d. Fix Ages (10 pts)

There are a few ages that do not match. We can find these by clicking on the mismatched values (the red bar). Again, we can show only those via the appropriate checkbox. These will look familiar to CS 680 students as we found them in A2. Let’s extract the number shown for the four cases where that is appropriate. Note that this isn’t quite right because >7 would imply a number greater than 7, but we’ll allow this for now.

Hints

Try selecting the pattern in the affected cell, and then selecting the pattern in unaffected cell. Trifacta should provide a fitting suggestion.

e. Save the file

Make sure to turn in the recipe (.wrangle file) when you’re done. You can also check your solution by running the job and downloading the resulting CSV.

2. Tidying Locations in Trifacta

We can either pick up where we left off by extending the recipe from Part 1 or create a new recipe that flows from that recipe (recommended), or create a new recipe that starts with the original dataset and create a new flow (if you do this, that quote cleanup will be required again). This time, we will only be concerned with the locations that the person being sought was known to be near. You will see the loc_1 through loc_5 columns for each record, along with the last_heard_loc and last_heard_date columns. We want to tidy these for better processing.

a. Delete irrelevant columns (5 pts)

Delete all columns except recid, loc_1, loc_2,loc_3, loc_4, loc_5, last_heard_loc, and last_heard_date columns.

b. Integrate last_heard columns (15 pts)

First, combine last_heard_loc and last_heard_date into a new loc_last column using the concatenate function, setting the separator to a comma to match the loc columns. Replace all entries with just a comma with an empty string. Then, find and use a function to create a new loc_1 column that combines loc_1 and loc_last. We want to fill in missing values in loc_1 with values from loc_last. Rename columns so that this coalesced column is now named loc_1. Delete all columns except recid and loc_1 through loc_5.

b. Unpivot columns (10 pts)

Unpivoting columns (or pivot longer) is a way of moving columns to rows. Unpivot the loc_1 through loc_5 columns, and remove all rows that have an empty string in the value column.

c. Extract places (15 pts)

We observe that most places in the location string are specified in a country?, state/province, city order with commas separating the various pieces. Write a regular expression that will extract such patterns; we will use this same regular expression in pandas so it’s better to write it once than translate from Trifacta’s pattern language. Note that a country, state/province, and city name will have alpha characters and perhaps spaces or periods. Not all places will have all that information so that pattern may show up one, two, or three times (with commas in between). Create a new column named place with this extracted string (no need to separate countries or states here!).

Hints

To have an optional character (e.g. a comma, use the question mark (?)).

d. Extract dates (15 pts)

There are multiple formats that are used for dates in the location fields, including MM/dd/yy, MM/yy, ’yy, and (yyyy) and likely more. We wish to extract all of these and put them in a uniform format. Write a regular expression to find strings that match these formats. Note that you may find strings that are extracted that are not quite right (e.g. parts of addresses); this is ok for this assignment.

Hints

To combine multiple patterns in one regular expression, use the pipe (|). For example, to find ‘US’ and ‘USA’, use ‘US|USA’
Patterns are matched left to right so start with the more specific patterns first when combining them.

e. [CS 680] Extract states (10 pts)

Trifacta has the ability to search for state abbreviations in data. Convert the place column to all uppercase characters. Convert NYC to “NY, New York City”. Now, extract any state abbreviations into a new column named state. For now, we will ignore countries.

Hints

Consult Trifacta’s patterns documentation for information about the types of patterns that can be extracted.

f. Save recipe.

Output the new locations data as info-wanted-locations.csv. Run the script and generate the final file. Make sure to save the recipe and turn it in with the assignment (see Part 1).

3. Fixing Ages and Dates in Pandas

a. Load the dataset

Download and load the raw dataset using pandas. This should be straightforward in pandas. There is no need to deal with quotes as this is taken care of automatically by pandas.

b. Standardize and Fill Date Information (15 pts)

This picks up from Part 4 (and Part 5b) in A2. Use the same steps as there to create a new date column (coercing errors). Now we will fill in the 12 problematic dates in this column. Find the pandas function to fill those dates down from the previous entries.

c. Fix Ages (10 pts)

Now, we will deal with the problematic ages discussed in Part 5a from A2. Here, we will extract any numeric part from the string (>7 -> 7 and 2? -> 2) and convert it. First, build a boolean expression that expresses those entries that are both not null and do not satisfy the str.isnumeric method. Then, use the str.extract method to extract a regular expression matching digits. Update the original age column with the five modified entries.

Hints

Store the boolean index so it can be used both the extract and set the modified values.
Consider using expand=False with extract.

d. Save the updated dataset

Save the updated dataset to a CSV file.

4. Tidying Locations in Pandas

a. Load starting data

Load the dataset you created in Part 3 (or use the raw dataset).

b. Integrate last_heard columns (15 pts)

Combine the two last_heard_loc and last_heard_date columns into a new loc_last column, putting a comma in between then (to make this similar to the loc_# columns). Most of the time, the loc_last data matches one of the values in loc_#. For those cases where loc_1 is empty, we’ll fill it with loc_last. fillna will allow replacement with another column.

Hints

Use the str.cat method with a separator.

c. Melt location columns (15 pts)

Use melt to melt the loc_1 through loc_5 columns, keeping recid as an id column. Drop all missing rows. Extract the numeric value only from the variable and name it order (e.g. loc_1 becomes 1).

Hints

dropna is useful for dropping the rows without any value.
We can use the same extract operation used in Part 3 to extract the order.

d. Extract places (10 pts)

Apply the same regular expression as with Trifacta (Part 2) to extract place name strings. Title this column place.

e. Extract dates (10 pts)

Apply the same regular expression as with Trifacta (Part 2) to extract date-like strings. Title this columns date.

f. [CS 680 Only] Extract place information (20 pts)

We wish to pull out countries, and states or provinces from the place name strings. At the end, we will have two new columns, state_prov and country. To do this, first convert everything to upper case and strip whitespace, then split the string based on commas, and then investigate the individual columns. For those columns, we will extract and map some country names, extract the fifty state abbreviations, set those entries that reference ‘NYC’ to the state ‘NY’, and extract and map Canadian provinces to abbreviations. The following lists and remaps will be useful:

countries = ['USA', 'CANADA', 'ENGLAND', 'AUSTRALIA', 'MEXICO', 'SCOTLAND','IRELAND']
countries_remap = {'AUSTR': 'AUSTRALIA','ENG': 'ENGLAND','CAN': 'CANADA','US': 'USA','IRE': 'IRELAND'}
provinces = ['QC','ON','NB','NF','PE','NS']
provinces_remap = {'QUEBEC': 'QC',
                'ONT': 'ON',
                'QUE': 'QC', 
                'NBR': 'NB',
                'NFLD': 'NF',
                'PEI': 'PE',
                'NOVA SCOTIA': 'NS',
                'NEW BRUNSWICK': 'NB'
               }

Remember to get and use the state abbreviations from the link above, too. We will set the country to CANADA for any entries where we extracted a province and USA for any entries where we extracted a state (including the NYC references). This is fairly involved, but pays off with states and provinces extracted for all but ~3000 entries.

Hints

To create a regular expression from the maps above, use the aggregation operator |. You can create this programmatically using the join() method.
Again, extract will pull out the matched string, but make sure to enclose in parentheses. Also, we probably want to match the start and the end of the string.
To use the remaps from above, consider the map method. You may wish to add entries for the original items to the remap dictionary. Otherwise they may be lost. A dictionary comprehension will work here.
Make sure you don’t overwrite existing values when you add new states or countries! Use fillna to only fill in missing entries with the newly computed values.

g. Save output (5 pts)

Write the output to a file info-wanted-locations.csv.

Extra Credit

CS 490 Students may complete the place extraction parts. All students may receive extra credit for extracting and properly formatting date information from the second column for extra credit in either Trifacta or pandas or both.