The goal of this assignment is to use OpenRefine and pandas to perform some data cleaning tasks.
Download OpenRefine for part of
the assignment. You should work on the other part of the assignment on
your own local installation of Jupyter and Python. You should use Python
3.10 or higher for your work. When working locally, Anaconda is the easiest
way to install and manage Python, and you may launch Jupyter Lab either
from the Navigator application or via the command-line as
jupyter-lab
.
In this assignment, we will clean the raw the Metropolitan Museum of Art Dataset which tracks information about artworks in its collection. This dataset is provided by The Metropolitan Museum of Art (aka The Met) and information about the columns is documented here. Because this dataset is continually updated and we want to focus on particular issues in the dataset, we will used a specific version and subset of the dataset, available here. Note that both OpenRefine and pandas can read gzip-compressed csv (.csv.gz) files directly from the URL (no need to download or uncompress).
The assignment is due at 11:59pm on Friday, March 8.
You should submit both (a) the completed OpenRefine histories as an
a3-part*.json
files, and (b) the single completed notebook
file named a3.ipynb
on Blackboard. Please use Markdown
headings to clearly separate your work in the notebook.
In both parts of the assignment, we will be transforming and cleaning the dataset from its raw form a more usable form. The set of changes include:
CSCI 490 students are responsible for all tasks except the artist transformations. CSCI 640 students are responsible for all tasks.
Start by downloading OpenRefine.
Documentation is available here. For each project (Part 1,
Part 2a, and Part 2b), when your project is completed, select the
Undo/Redo tab, click the “Extract…” button and click the “Export” button
in dialog, saving the operation history as a3-part1.json
.
Turn in that file as your solution for Part 1.
After starting OpenRefine, you will see the “Create Project” tab. Note that you can also open an existing project from this screen. You can either download the raw dataset to your computer or use the “Web Addresses (URLs)” source to point to the URL directly. Click “Next” to immediately start working. Then, change any settings if necessary, and click the “Create Project” button.
The Medium
column is written with both capitalized words
and lowercase words. Standardize this by converting this to all
uppercase. Also strip any leading or trailing whitespace. You can verify
the need for the whitespace removal by finding the number of unique
values for the column before and after this change. Replace the original
Medium
column with your cleaned column.
There are only a few problematic accession years, and OpenRefine
should allow you to see these by creating a facet that looks for those
that are not numbers. Think about what regular expression checks if a
value is all digits, and then use the match
function to extract that string. Now, figure out how you can see if
the whole string matches that extracted string. There should be
40 values that violate the constraint. Now, write a
custom transform that uses the same digit extraction to extract the year
only from the values that have dates to update the
AccessionYear
columns.
Convert the Object Begin Date
and
Object End Date
columns to numbers. There are certain years
where the Object End Date
is before the
Object Begin Date
. You can create a facet to see these, but
note that you will need to use the row
variables to access
more than one column in a comparison. Specifically, consider the custom
facet
row.cells["Object Begin Date"] > row.cells["Object End Date"]
.
Looking at those entries for which this is true, we can also examine how
the original Object Begin Date
and
Object End Date
fields correspond with the entries in the
Object Date
field. In some cases, the end date is 0 which
may be a placeholder for either an unknown or missing value or the same
year. In other cases, it seems like something went wrong with the
translation specifically with dates that begin in 664 B.C. (-664). We
will try to fix the second case, and set the others to null values to
indicate that there is some issue with the object end date. To
accomplish this, craft a way to extract the end date from the
Object Date
column but only when we see a dash (either - or
–) and the “B.C.” ending since we know that is where things are
problematic. Now, coalesce the new column with the old column, making
sure that the new column takes precedence. Finally, fill those remaining
values where the Object End Date
is less than
Object Begin Date
with null. Finally, make sure to remove
the original column and rename the cleaned column as
Object End Date
.
Save your operation history for this part as “a3-part1.json” by clicking the Undo/Redo tag, and then “Extract…” followed by “Export”. You can also run your extracted history on the dataset and download the resulting csv file to compare your results with those from Part 3.
Create a third project that extracts artist information and creates a
table with the artist information split into rows for each of the
objects in the collection dataset. For this part, we are interested in
four columns: Artist Role
,
Artist Display Name
, Artist Begin Date
, and
Artist End Date
, plus the Object ID
column.
Like Part a, we can split the individual columns into arrays. However,
in contrast to Part a, we need to keep the correspondences between the
extracted elements. In other words, the first element in the
Artist Role
column should be paired with the first element
in the Artist Display Name
column, etc. Here we need to
construct a rather complex GREL expression or
use the Jython extension. To help, consider the following intermediate
steps:
Author|Publisher|Printer
Bernardino Baldi|Johannes Praetorius|Ad Insigne Pinus
1553 | |1594
1617 | |1619
[ [ "Author", "Publisher", "Printer" ],
[ "Bernardino Baldi", "Johannes Praetorius", "Ad Insigne Pinus" ],
[ "1553", "", "1594" ],
[ "1617", "", "1619" ] ]
[ [ "Author", "Bernardino Baldi", "1553", "1617" ],
[ "Publisher", "Johannes Praetorius", "", "" ],
[ "Printer", "Ad Insigne Pinus", "1594", "1619" ] ]
Author|Bernardino Baldi|1553|1617
Publisher|Johannes Praetorius||
Printer|Ad Insigne Pinus|1594|1619
Then, we can split based on the new lines to create new rows, and split on columns to get back to the four original columns. You will need to rename the columns appropriately. Turn in this project’s operation history as “a3-part2b.json”.
forRange
to count through arrays (also
arrays have a length
method)We will repeat the operations in Part 1 but now using pandas.
Use panda’s read_csv
method to load the data. Be aware
that you may improve type inference if you are able to set
low_memory=False
. Store the loaded data frame into a
variable for use throughout the assignment.
The Medium
column is written with both capitalized words
and lowercase words. Standardize this by converting this to all
uppercase. Also strip any leading or trailing whitespace. You can verify
the need for the whitespace removal by finding the number of unique
values for the column before and after this change. Replace the original
Medium
column with your cleaned column.
.str
prefix
before calling the method.Update the accession year to be only a year as some values show a full date. You can locate the problematic values by searching for those that are not only digits. Update the column so it is an integer and only contains the year.
.str
prefix
before calling the method.There are certain years where the Object End Date
is
before the Object Begin Date
. Examine these by writing a
boolean query. These fields should correspond with the entries in the
Object Date
field. In some cases, the end date is 0 which
may be a placeholder for either an unknown or missing value or the same
year. In other cases, it seems like something went wrong with the
translation specifically with dates that begin in 664 B.C. (-664). We
will try to fix the second case, and set the others to null values to
indicate that there is some issue with the object end date. To
accomplish this, craft a way to extract the end date from the
Object Date
column but only when we see a dash (either - or
–) and the “B.C.” ending since we know that is where things are
problematic. (Consider a capturing regular expression for this.) Now,
replace the values in the Object End Date
column with such
dates. Finally, fill the remaining dates where the end date is before
the begin date with NaN
values to indicate we don’t know
that value.
np.nan
value to fill the
problematic cells.Create a third dataframe that contains artist information split into
rows for each of the objects in the collection dataset. This requires
less manual effort than in OpenRefine because we can loop over the
columns. Like Part a, we can split the individual columns into lists.
However, pandas will not allow you to explode multiple columns at once;
it will allow you to create a new data frame from all of the individual
exploded columns. Again, we are interested in four columns:
Artist Role
, Artist Display Name
,
Artist Begin Date
, and Artist End Date
, and
need to keep the correspondences between the extracted elements. In
other words, the first element in the Artist Role
column
should be paired with the first element in the
Artist Display Name
column. The final table should retain
Object ID
.
Object Begin Date
, Artist Begin Date
, and
Artist End Date
.