Assignment 3

Goals

The goal of this assignment is to use Trifacta Wrangler and pandas to perform some data cleaning tasks.

Instructions

You will need to create a free account for Trifacta Wrangler and use the hosted environment there for part of the assignment. You should work on the other part of the assignment on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work. When working locally, Anaconda is the easiest way to install and manage Python, and you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab.

In this assignment, we will clean the raw the Metropolitan Museum of Art Dataset which tracks information about artworks in its collection. This dataset is provided by The Metropolitan Museum of Art (aka The Met) and information about the columns is documented here. Because this dataset is continually updated and we want to focus on particular issues in the dataset, we will used a specific version and subset of the dataset, available here. Note that both Wrangler and pandas can read gzip-compressed csv (.csv.gz) files without you needing to first uncompress them.

Due Date

The assignment is due at 11:59pm on Friday, March 11.

Submission

You should submit both (a) the completed Trifacta recipes as an a3-part*.wrangle files, and (b) the single completed notebook file named a3.ipynb on Blackboard. Please use Markdown headings to clearly separate your work in the notebook.

Details

In both parts of the assignment, we will be transforming and cleaning the dataset from its raw form a more usable form. The set of changes include:

  • Cleaning the medium column
  • Fixing problematic accession years
  • Fixing problematic object end dates
  • Transforming tags from a delimited string to a new table
  • Transforming artist information columns into a new table

CSCI 490 students are responsible for all tasks except the artist transformations. CSCI 680 students are responsible for all tasks.

1. Cleaning in Trifacta Wrangler

Start by signing up for a free Wrangler account; click the “Start Wrangler” link here. Please note that Wrangler requires the Chrome browser. Documentation is available here. For each recipe, when your recipe is completed, show the Recipe, select the gears icon, and select “Download Recipe as wrangle”. Turn in that file as your solution for Part 1.

a. Load Data

Begin by downloading the raw dataset to your computer. After signing in to Wrangler, click the “Import Data” button on the upper-right side of the home screen, and import the dataset you downloaded. You can choose the “Import and Wrangle” button to immediately start working. If you choose “Import” instead or come back to this later, you can go to the flows icon and create a new flow, selecting the dataset you have imported.

b. Convert Medium (10 pts)

The Medium column is written with both capitalized words and lowercase words. Standardize this by converting this to all uppercase. Also strip any leading or trailing whitespace. You can verify the need for the whitespace removal by finding the number of unique values for the column before and after this change. Replace the original Medium column with your cleaned column.

c. Fix Problematic Accession Years (10 pts)

There are only a few problematic accession year, and Trifacta should allow you to see these by clicking the red bar in the column. Based on these values, create a rule to extract only the year, making sure to remove the original column and rename the new column AccessionYear.

d. Fix Object End Dates (15 pts)

There are certain years where the Object End Date is before the Object Begin Date. These fields should correspond with the entries in the Object Date field. In some cases, the end date is 0 which may be a placeholder for either an unknown or missing value or the same year. In other cases, it seems like something went wrong with the translation specifically with dates that begin in 664 B.C. (-664). We will try to fix the second case, and set the others to null values to indicate that there is some issue with the object end date. To accomplish this, craft a way to extract the end date from the Object Date column but only when we see a dash (either - or –) and the “B.C.” ending since we know that is where things are problematic. Now, coalesce the new column with the old column, making sure that the new column takes precedence. Finally, fill those remaining values where the Object End Date is less than Object Begin Date with null. Finally, make sure to remove the original column and rename the cleaned column as Object End Date.

e. Save Recipe

Save your recipe for this part as “a3-part1.wrangle” by clicking the recipe view, and then the gear icon and selecting “Download Recipe as wrangle”. You can also run your recipe on the dataset and download the resulting csv file to compare your results with those from Part 3.

2. Transformation in Trifacta Wrangler

a. Transform Tags (15 pts)

Create a new recipe that extracts the Object ID and Tags columns so that we have a new table from Object ID to each individual tag. To create a new recipe, go back to the flow, and create a new recipe (either off of the original data or the recipe from Part 1). Then select the Object ID and Tags columns and create a new table or delete all the other columns. Now, we want to split each string in the Tags column into an array of values based on the | delimiter. Note that this is a special character is both Wrangler’s language and regular expressions and therefor should be escaped with a backslash. Finally, create a new row for each tag, removing any rows that have no tags. Turn in this recipe as “a3-part2a.wrangle”.

Hints
  • Once you have selected columns, you can click the … on the right panel to see options about what to do with the columns.
  • While it is possible to do this with splits, you will end up with an unwieldy number of columns
  • Search for the transformation “Extract matches to Array” and split on the correct delimiter
  • After extracting to the array, check the Restructure submenu for a command to expand the array into rows.

b. [CSCI 680 Only] Transform Artists (20 pts)

Create a third recipe that extracts artist information and creates a table with the artist information split into rows for each of the objects in the collection dataset. For this part, we are interested in four columns: Artist Role, Artist Display Name, Artist Begin Date, and Artist End Date. Like Part a, we can split the individual columns into arrays. However, in contrast to Part a, we need to keep the correspondences between the extracted elements. In other words, the first element in the Artist Role column should be paired with the first element in the Artist Display Name column. The final table should retain Object ID. Here, the ARRAYZIP function (search for transformation) will be useful. In addition, we can extract elements to columns, but then we need to split the arrays back into individual columns again and rename them appropriately. Turn in this recipe as “a3-part2b.wrangle”.

Hints

  • Use the magnifying glass icon to search for particular functions
  • You can unnest arrays into columns, but will need to use array indexing to extract out each column of elements (e.g. [0]); you can do multiple of these extractions in one step, though.

3. Cleaning in Pandas

We will repeat the operations in Part 1 but now using pandas.

a. Load Data

Use panda’s read_csv method to load the data. Be aware that you may improve type inference if you are able to set low_memory=False. Store the loaded data frame into a variable for use throughout the assignment.

b. Convert Medium (10 pts)

The Medium column is written with both capitalized words and lowercase words. Standardize this by converting this to all uppercase. Also strip any leading or trailing whitespace. You can verify the need for the whitespace removal by finding the number of unique values for the column before and after this change. Replace the original Medium column with your cleaned column.

Hints
  • pandas’s string methods are useful here. Remember the .str prefix before calling the method.
  • I prefer to use pandas’s assign method to preserve the original dataframe

c. Fix Problematic Accession Years (10 pts)

There are only a few problematic accession year, and Trifacta should allow you to see these by clicking the red bar in the column. Based on these values, create a rule to extract only the year, making sure to remove the original column and rename the new column AccessionYear.

Hints
  • pandas’s string methods are useful here. Remember the .str prefix before calling the method.

d. Fix Object End Dates (15 pts)

There are certain years where the Object End Date is before the Object Begin Date. Examine these by writing a boolean query. These fields should correspond with the entries in the Object Date field. In some cases, the end date is 0 which may be a placeholder for either an unknown or missing value or the same year. In other cases, it seems like something went wrong with the translation specifically with dates that begin in 664 B.C. (-664). We will try to fix the second case, and set the others to null values to indicate that there is some issue with the object end date. To accomplish this, craft a way to extract the end date from the Object Date column but only when we see a dash (either - or –) and the “B.C.” ending since we know that is where things are problematic. (Consider a capturing regular expression for this.) Now, replace the values in the Object End Date column with such dates. Finally, fill the remaining dates where the end date is before the begin date with NaN values to indicate we don’t know that value.

Hints
  • You can create a boolean series and reuse it in multiple statements.
  • You can use numpy’s np.nan value to fill the problematic cells.

4. Transformation in Pandas

a. Transform Tags (15 pts)

Create a new dataframe with one row per tag and the corresponding Object ID. Here, we can split the Tags column using pandas string methods, and then the explode method to move each of those entries to a new row. Make sure that the Object ID column is still there and correct after running explode.

Hints
  • pandas preserves the index during many operations, or you may be able to explode one particular column.

b. [CSCI 680 Only] Transform Artists (20 pts)

Create a third dataframe that contains artist information split into rows for each of the objects in the collection dataset. This requires less manual effort than in Wrangler because we can loop over the columns. Like Part a, we can split the individual columns into lists. However, pandas will not allow you to explode multiple columns at once; it will allow you to create a new data frame from all of the individual exploded columns. Again, we are interested in four columns: Artist Role, Artist Display Name, Artist Begin Date, and Artist End Date, and need to keep the correspondences between the extracted elements. In other words, the first element in the Artist Role column should be paired with the first element in the Artist Display Name column. The final table should retain Object ID.

Hints

  • Think about whether you can use loops or comprehensions to make your code more robust.

Extra Credit

  • CSCI 490 Students may complete Parts 2b and 4b.
  • All students may work to fix other dates, including Object Begin Date, Artist Begin Date, and Artist End Date.
  • All students may work to find an efficient solution to expand and include all artist columns for Part 4b.