The goal of this assignment is to use Trifacta Wrangler and pandas to perform data cleaning tasks.
You will need to create a free account for Trifacta Wrangler and use the hosted environment there for the first part of the assignment. You may choose to work on the second part of the assignment on a hosted environment (e.g. Google Colab) or on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work (although Colab’s 3.6 should work). To use cloud resources, create/login to a free Google account for Colab. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab
.
I Information Wanted dataset which tracks advertisements for those looking for lost friends and relatives. This dataset is provided by Boston College based on work supervised by Dr. Ruth-Ann Harris and is documented here. Our goal is to clean the dataset to make it easier to glean information from.
The assignment is due at 11:59pm on Wednesday, March 3.
You should submit both (a) the completed Trifacta recipe as an a3.wrangle
file and (b) the completed notebook file named a3.ipynb
on Blackboard. Note that you may do the parts separately as well, and if so, please turn in all recipes/notebooks (e.g. a3-part1.wrangle
and a3-part2.wrangle
). You may need to zip the files together to turn them in via Blackboard. Please use Markdown headings to clearly separate your work in the notebook.
In both parts of the assignment, we will be transforming and cleaning the dataset from its raw form. The set of changes include:
CS 680 students are responsible for all tasks. CS 490 students need not extract states and countries.
Start by signing up for a free Wrangler account; click the “Start Wrangler” link here. Please note that Wrangler requires the Chrome browser. Documentation is available here. After your recipe(s) is completed, show the Recipe, select the gears icon, and select “Download Recipe as wrangle”. Turn the recipe file(s) in as your solution for Parts 1 and 2.
Begin by downloading the raw dataset from the course website to your computer. After signing in to Wrangler, click the “Import Data” button on the upper-right side of the home screen, and import the dataset you downloaded. You can choose the “Import and Wrangle” button to immediately start working. If you choose “Import” instead or come back to this later, you can go to the flows icon and create a new flow, selecting the dataset you have imported.
Because of the way Trifacta loads CSV data, some entries will be surrounded by double-quotes. We don’t want these. Add a step to delete quotes when they are at the beginning and end of any string. You can use a replace method over all columns to do this.
{start}
and {end}
and capture the middle portion of the text via parentheses ({any}*)
, in a similar way to regular expressions in Python.Create a new column that converts the yy
column to the appropriate date. As with Assignment 2, you will need to do this by using the knowledge that the two digit date must map to a year from 1830 to 1920. You may use some math or a if-then statement to compute the year, and then the DATEFORMAT and DATE methods to create a new date. Note, however, that you will likely have to change the type of the new year column to integer before the date conversion will work.
Now, look at those dates that are invalid by clicking on the far right of the bar in the new date column to highlight the eleven missing values. Rather than scrolling through these, click the checkbox in the lower right to only show the affected rows. Now replace the missing values by filling down (using the values directly above the problematic dates).
There are a few ages that do not match. We can find these by clicking on the mismatched values (the red bar). Again, we can show only those via the appropriate checkbox. These will look familiar to CS 680 students as we found them in A2. Let’s extract the number shown for the four cases where that is appropriate. Note that this isn’t quite right because >7
would imply a number greater than 7, but we’ll allow this for now.
Make sure to turn in the recipe (.wrangle
file) when you’re done. You can also check your solution by running the job and downloading the resulting CSV.
We can either pick up where we left off by extending the recipe from Part 1 or create a new recipe that flows from that recipe (recommended), or create a new recipe that starts with the original dataset and create a new flow (if you do this, that quote cleanup will be required again). This time, we will only be concerned with the locations that the person being sought was known to be near. You will see the loc_1
through loc_5
columns for each record, along with the last_heard_loc
and last_heard_date
columns. We want to tidy these for better processing.
Delete all columns except recid
, loc_1
, loc_2
,loc_3
, loc_4
, loc_5
, last_heard_loc
, and last_heard_date
columns.
First, combine last_heard_loc and last_heard_date into a new loc_last column using the concatenate function, setting the separator to a comma to match the loc columns. Replace all entries with just a comma with an empty string. Then, find and use a function to create a new loc_1 column that combines loc_1 and loc_last. We want to fill in missing values in loc_1 with values from loc_last. Rename columns so that this coalesced column is now named loc_1. Delete all columns except recid and loc_1 through loc_5.
Unpivoting columns (or pivot longer) is a way of moving columns to rows. Unpivot the loc_1
through loc_5
columns, and remove all rows that have an empty string in the value column.
We observe that most places in the location string are specified in a country?, state/province, city order with commas separating the various pieces. Write a regular expression that will extract such patterns; we will use this same regular expression in pandas so it’s better to write it once than translate from Trifacta’s pattern language. Note that a country, state/province, and city name will have alpha characters and perhaps spaces or periods. Not all places will have all that information so that pattern may show up one, two, or three times (with commas in between). Create a new column named place
with this extracted string (no need to separate countries or states here!).
?
)).There are multiple formats that are used for dates in the location fields, including MM/dd/yy, MM/yy, ’yy, and (yyyy) and likely more. We wish to extract all of these and put them in a uniform format. Write a regular expression to find strings that match these formats. Note that you may find strings that are extracted that are not quite right (e.g. parts of addresses); this is ok for this assignment.
|
). For example, to find ‘US’ and ‘USA’, use ‘US|USA’Trifacta has the ability to search for state abbreviations in data. Convert the place column to all uppercase characters. Convert NYC to “NY, New York City”. Now, extract any state abbreviations into a new column named state. For now, we will ignore countries.
Output the new locations data as info-wanted-locations.csv
. Run the script and generate the final file. Make sure to save the recipe and turn it in with the assignment (see Part 1).
Download and load the raw dataset using pandas. This should be straightforward in pandas. There is no need to deal with quotes as this is taken care of automatically by pandas.
This picks up from Part 4 (and Part 5b) in A2. Use the same steps as there to create a new date column (coercing errors). Now we will fill in the 12 problematic dates in this column. Find the pandas function to fill those dates down from the previous entries.
Now, we will deal with the problematic ages discussed in Part 5a from A2. Here, we will extract any numeric part from the string (>7
-> 7
and 2?
-> 2
) and convert it. First, build a boolean expression that expresses those entries that are both not null and do not satisfy the str.isnumeric
method. Then, use the str.extract
method to extract a regular expression matching digits. Update the original age column with the five modified entries.
expand=False
with extract
.Save the updated dataset to a CSV file.
Load the dataset you created in Part 3 (or use the raw dataset).
Combine the two last_heard_loc and last_heard_date columns into a new loc_last
column, putting a comma in between then (to make this similar to the loc_# columns). Most of the time, the loc_last
data matches one of the values in loc_#
. For those cases where loc_1
is empty, we’ll fill it with loc_last
. fillna
will allow replacement with another column.
str.cat
method with a separator.Use melt
to melt the loc_1
through loc_5
columns, keeping recid
as an id column. Drop all missing rows. Extract the numeric value only from the variable and name it order
(e.g. loc_1
becomes 1
).
dropna
is useful for dropping the rows without any value.Apply the same regular expression as with Trifacta (Part 2) to extract place name strings. Title this column place
.
Apply the same regular expression as with Trifacta (Part 2) to extract date-like strings. Title this columns date
.
We wish to pull out countries, and states or provinces from the place name strings. At the end, we will have two new columns, state_prov
and country
. To do this, first convert everything to upper case and strip whitespace, then split the string based on commas, and then investigate the individual columns. For those columns, we will extract and map some country names, extract the fifty state abbreviations, set those entries that reference ‘NYC’ to the state ‘NY’, and extract and map Canadian provinces to abbreviations. The following lists and remaps will be useful:
= ['USA', 'CANADA', 'ENGLAND', 'AUSTRALIA', 'MEXICO', 'SCOTLAND','IRELAND']
countries = {'AUSTR': 'AUSTRALIA','ENG': 'ENGLAND','CAN': 'CANADA','US': 'USA','IRE': 'IRELAND'}
countries_remap = ['QC','ON','NB','NF','PE','NS']
provinces = {'QUEBEC': 'QC',
provinces_remap 'ONT': 'ON',
'QUE': 'QC',
'NBR': 'NB',
'NFLD': 'NF',
'PEI': 'PE',
'NOVA SCOTIA': 'NS',
'NEW BRUNSWICK': 'NB'
}
Remember to get and use the state abbreviations from the link above, too. We will set the country to CANADA
for any entries where we extracted a province and USA
for any entries where we extracted a state (including the NYC
references). This is fairly involved, but pays off with states and provinces extracted for all but ~3000 entries.
|
. You can create this programmatically using the join()
method.map
method. You may wish to add entries for the original items to the remap dictionary. Otherwise they may be lost. A dictionary comprehension will work here.fillna
to only fill in missing entries with the newly computed values.Write the output to a file info-wanted-locations.csv
.
CS 490 Students may complete the place extraction parts. All students may receive extra credit for extracting and properly formatting date information from the second column for extra credit in either Trifacta or pandas or both.