Assignment 3

Goals

The goal of this assignment is to use OpenRefine and pandas to perform some data cleaning tasks.

Instructions

Download OpenRefine for part of the assignment. You should work on the other part of the assignment on your own local installation of Jupyter and Python. You should use Python 3.10 or higher for your work. When working locally, Anaconda is the easiest way to install and manage Python, and you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab.

In this assignment, we will clean the raw the Metropolitan Museum of Art Dataset which tracks information about artworks in its collection. This dataset is provided by The Metropolitan Museum of Art (aka The Met) and information about the columns is documented here. Because this dataset is continually updated and we want to focus on particular issues in the dataset, we will used a specific version and subset of the dataset, available here. Note that both OpenRefine and pandas can read gzip-compressed csv (.csv.gz) files directly from the URL (no need to download or uncompress).

Due Date

The assignment is due at 11:59pm on Friday, March 8.

Submission

You should submit both (a) the completed OpenRefine histories as an a3-part*.json files, and (b) the single completed notebook file named a3.ipynb on Blackboard. Please use Markdown headings to clearly separate your work in the notebook.

Details

In both parts of the assignment, we will be transforming and cleaning the dataset from its raw form a more usable form. The set of changes include:

  • Cleaning the medium column
  • Fixing problematic accession years
  • Fixing problematic object end dates
  • Transforming tags from a delimited string to a new table
  • Transforming artist information columns into a new table

CSCI 490 students are responsible for all tasks except the artist transformations. CSCI 640 students are responsible for all tasks.

1. Cleaning in OpenRefine

Start by downloading OpenRefine. Documentation is available here. For each project (Part 1, Part 2a, and Part 2b), when your project is completed, select the Undo/Redo tab, click the “Extract…” button and click the “Export” button in dialog, saving the operation history as a3-part1.json. Turn in that file as your solution for Part 1.

a. Load Data

After starting OpenRefine, you will see the “Create Project” tab. Note that you can also open an existing project from this screen. You can either download the raw dataset to your computer or use the “Web Addresses (URLs)” source to point to the URL directly. Click “Next” to immediately start working. Then, change any settings if necessary, and click the “Create Project” button.

b. Convert Medium (10 pts)

The Medium column is written with both capitalized words and lowercase words. Standardize this by converting this to all uppercase. Also strip any leading or trailing whitespace. You can verify the need for the whitespace removal by finding the number of unique values for the column before and after this change. Replace the original Medium column with your cleaned column.

c. Fix Problematic Accession Years (10 pts)

There are only a few problematic accession years, and OpenRefine should allow you to see these by creating a facet that looks for those that are not numbers. Think about what regular expression checks if a value is all digits, and then use the match function to extract that string. Now, figure out how you can see if the whole string matches that extracted string. There should be 40 values that violate the constraint. Now, write a custom transform that uses the same digit extraction to extract the year only from the values that have dates to update the AccessionYear columns.

d. Fix Object End Dates (15 pts)

Convert the Object Begin Date and Object End Date columns to numbers. There are certain years where the Object End Date is before the Object Begin Date. You can create a facet to see these, but note that you will need to use the row variables to access more than one column in a comparison. Specifically, consider the custom facet row.cells["Object Begin Date"] > row.cells["Object End Date"]. Looking at those entries for which this is true, we can also examine how the original Object Begin Date and Object End Date fields correspond with the entries in the Object Date field. In some cases, the end date is 0 which may be a placeholder for either an unknown or missing value or the same year. In other cases, it seems like something went wrong with the translation specifically with dates that begin in 664 B.C. (-664). We will try to fix the second case, and set the others to null values to indicate that there is some issue with the object end date. To accomplish this, craft a way to extract the end date from the Object Date column but only when we see a dash (either - or –) and the “B.C.” ending since we know that is where things are problematic. Now, coalesce the new column with the old column, making sure that the new column takes precedence. Finally, fill those remaining values where the Object End Date is less than Object Begin Date with null. Finally, make sure to remove the original column and rename the cleaned column as Object End Date.

e. Save Operation History

Save your operation history for this part as “a3-part1.json” by clicking the Undo/Redo tag, and then “Extract…” followed by “Export”. You can also run your extracted history on the dataset and download the resulting csv file to compare your results with those from Part 3.

2. Transformation in OpenRefine

a. Transform Tags (15 pts)

Create a new project that loads the same file but then deletes all columns but the Object ID and Tags columns. Click the arrow next to the “All” header and choose the remove/rename columns operation. Now, we want drop all rows where the Tags are empty (null), and then split each string in the Tags column into individual rows based on the | delimiter. Then, fill in the Object ID column with the previous entry. Turn in this operation history as “a3-part2a.json”.

Hints
  • To drop rows where the Tags are empty, create a facet for null values and then use that facet to remove matching rows.
  • Don’t split the column into separate columns, but rather generate new rows.

b. [CSCI 640 Only] Transform Artists (20 pts)

Create a third project that extracts artist information and creates a table with the artist information split into rows for each of the objects in the collection dataset. For this part, we are interested in four columns: Artist Role, Artist Display Name, Artist Begin Date, and Artist End Date, plus the Object ID column. Like Part a, we can split the individual columns into arrays. However, in contrast to Part a, we need to keep the correspondences between the extracted elements. In other words, the first element in the Artist Role column should be paired with the first element in the Artist Display Name column, etc. Here we need to construct a rather complex GREL expression or use the Jython extension. To help, consider the following intermediate steps:

  1. Starting example showing each of the four columns on a separate line:
    Author|Publisher|Printer
    Bernardino Baldi|Johannes Praetorius|Ad Insigne Pinus
    1553      |          |1594
    1617      |          |1619
  1. Construct an array of arrays for the four columns
  [ [ "Author", "Publisher", "Printer" ], 
    [ "Bernardino Baldi", "Johannes Praetorius", "Ad Insigne Pinus" ], 
    [ "1553", "", "1594" ], 
    [ "1617", "", "1619" ] ]
  1. “Zip” these columns (no such operation exists natively)
  [ [ "Author", "Bernardino Baldi", "1553", "1617" ],
    [ "Publisher", "Johannes Praetorius", "", "" ], 
    [ "Printer", "Ad Insigne Pinus", "1594", "1619" ] ]
  1. Join the arrays for each artist and then create a new line for each artist (single cell!):
    Author|Bernardino Baldi|1553|1617
    Publisher|Johannes Praetorius||
    Printer|Ad Insigne Pinus|1594|1619

Then, we can split based on the new lines to create new rows, and split on columns to get back to the four original columns. You will need to rename the columns appropriately. Turn in this project’s operation history as “a3-part2b.json”.

Hints

  • You can use forRange to count through arrays (also arrays have a length method)

3. Cleaning in Pandas

We will repeat the operations in Part 1 but now using pandas.

a. Load Data

Use panda’s read_csv method to load the data. Be aware that you may improve type inference if you are able to set low_memory=False. Store the loaded data frame into a variable for use throughout the assignment.

b. Convert Medium (10 pts)

The Medium column is written with both capitalized words and lowercase words. Standardize this by converting this to all uppercase. Also strip any leading or trailing whitespace. You can verify the need for the whitespace removal by finding the number of unique values for the column before and after this change. Replace the original Medium column with your cleaned column.

Hints
  • pandas’s string methods are useful here. Remember the .str prefix before calling the method.
  • I prefer to use pandas’s assign method to preserve the original dataframe

c. Fix Problematic Accession Years (10 pts)

Update the accession year to be only a year as some values show a full date. You can locate the problematic values by searching for those that are not only digits. Update the column so it is an integer and only contains the year.

Hints
  • pandas’s string methods are useful here. Remember the .str prefix before calling the method.

d. Fix Object End Dates (15 pts)

There are certain years where the Object End Date is before the Object Begin Date. Examine these by writing a boolean query. These fields should correspond with the entries in the Object Date field. In some cases, the end date is 0 which may be a placeholder for either an unknown or missing value or the same year. In other cases, it seems like something went wrong with the translation specifically with dates that begin in 664 B.C. (-664). We will try to fix the second case, and set the others to null values to indicate that there is some issue with the object end date. To accomplish this, craft a way to extract the end date from the Object Date column but only when we see a dash (either - or –) and the “B.C.” ending since we know that is where things are problematic. (Consider a capturing regular expression for this.) Now, replace the values in the Object End Date column with such dates. Finally, fill the remaining dates where the end date is before the begin date with NaN values to indicate we don’t know that value.

Hints
  • You can create a boolean series and reuse it in multiple statements.
  • You can use numpy’s np.nan value to fill the problematic cells.

4. Transformation in Pandas

a. Transform Tags (15 pts)

Create a new dataframe with one row per tag and the corresponding Object ID. Here, we can split the Tags column using pandas string methods, and then the explode method to move each of those entries to a new row. Make sure that the Object ID column is still there and correct after running explode.

Hints
  • pandas preserves the index during many operations, or you may be able to explode one particular column.

b. [CSCI 640 Only] Transform Artists (20 pts)

Create a third dataframe that contains artist information split into rows for each of the objects in the collection dataset. This requires less manual effort than in OpenRefine because we can loop over the columns. Like Part a, we can split the individual columns into lists. However, pandas will not allow you to explode multiple columns at once; it will allow you to create a new data frame from all of the individual exploded columns. Again, we are interested in four columns: Artist Role, Artist Display Name, Artist Begin Date, and Artist End Date, and need to keep the correspondences between the extracted elements. In other words, the first element in the Artist Role column should be paired with the first element in the Artist Display Name column. The final table should retain Object ID.

Hints

  • Think about whether you can use loops or comprehensions to make your code more robust.

Extra Credit

  • CSCI 490 Students may complete Parts 2b and 4b.
  • All students may work to fix other dates, including Object Begin Date, Artist Begin Date, and Artist End Date.
  • All students may work to find an efficient solution to expand and include all artist columns for Part 4b.