The goal of this assignment is to use Trifacta Wrangler and pandas to perform some data cleaning tasks.
You will need to create a free account for Trifacta Wrangler
and use the hosted environment there for part of the assignment. You
should work on the other part of the assignment on your own local
installation of Jupyter and Python. You should use Python 3.8 or higher
for your work. When working locally, Anaconda is the easiest
way to install and manage Python, and you may launch Jupyter Lab either
from the Navigator application or via the command-line as
jupyter-lab
.
In this assignment, we will clean the raw the Metropolitan Museum of Art Dataset which tracks information about artworks in its collection. This dataset is provided by The Metropolitan Museum of Art (aka The Met) and information about the columns is documented here. Because this dataset is continually updated and we want to focus on particular issues in the dataset, we will used a specific version and subset of the dataset, available here. Note that both Wrangler and pandas can read gzip-compressed csv (.csv.gz) files without you needing to first uncompress them.
The assignment is due at 11:59pm on Friday, March 11.
You should submit both (a) the completed Trifacta recipes as an
a3-part*.wrangle
files, and (b) the single completed
notebook file named a3.ipynb
on Blackboard. Please use Markdown
headings to clearly separate your work in the notebook.
In both parts of the assignment, we will be transforming and cleaning the dataset from its raw form a more usable form. The set of changes include:
CSCI 490 students are responsible for all tasks except the artist transformations. CSCI 680 students are responsible for all tasks.
Start by signing up for a free Wrangler account; click the “Start Wrangler” link here. Please note that Wrangler requires the Chrome browser. Documentation is available here. For each recipe, when your recipe is completed, show the Recipe, select the gears icon, and select “Download Recipe as wrangle”. Turn in that file as your solution for Part 1.
Begin by downloading the raw dataset to your computer. After signing in to Wrangler, click the “Import Data” button on the upper-right side of the home screen, and import the dataset you downloaded. You can choose the “Import and Wrangle” button to immediately start working. If you choose “Import” instead or come back to this later, you can go to the flows icon and create a new flow, selecting the dataset you have imported.
The Medium
column is written with both capitalized words
and lowercase words. Standardize this by converting this to all
uppercase. Also strip any leading or trailing whitespace. You can verify
the need for the whitespace removal by finding the number of unique
values for the column before and after this change. Replace the original
Medium
column with your cleaned column.
There are only a few problematic accession year, and Trifacta should
allow you to see these by clicking the red bar in the column. Based on
these values, create a rule to extract only the year, making sure to
remove the original column and rename the new column
AccessionYear
.
There are certain years where the Object End Date
is
before the Object Begin Date
. These fields should
correspond with the entries in the Object Date
field. In
some cases, the end date is 0 which may be a placeholder for either an
unknown or missing value or the same year. In other cases, it seems like
something went wrong with the translation specifically with dates that
begin in 664 B.C. (-664). We will try to fix the second case, and set
the others to null values to indicate that there is some issue with the
object end date. To accomplish this, craft a way to extract the end date
from the Object Date
column but only when we see a dash
(either - or –) and the “B.C.” ending since we know that is where things
are problematic. Now, coalesce the new column with the old column,
making sure that the new column takes precedence. Finally, fill those
remaining values where the Object End Date
is less than
Object Begin Date
with null. Finally, make sure to remove
the original column and rename the cleaned column as
Object End Date
.
Save your recipe for this part as “a3-part1.wrangle” by clicking the recipe view, and then the gear icon and selecting “Download Recipe as wrangle”. You can also run your recipe on the dataset and download the resulting csv file to compare your results with those from Part 3.
Create a third recipe that extracts artist information and creates a
table with the artist information split into rows for each of the
objects in the collection dataset. For this part, we are interested in
four columns: Artist Role
,
Artist Display Name
, Artist Begin Date
, and
Artist End Date
. Like Part a, we can split the individual
columns into arrays. However, in contrast to Part a, we need to keep the
correspondences between the extracted elements. In other words, the
first element in the Artist Role
column should be paired
with the first element in the Artist Display Name
column.
The final table should retain Object ID
. Here, the
ARRAYZIP
function (search for transformation) will be
useful. In addition, we can extract elements to columns, but then we
need to split the arrays back into individual columns again and rename
them appropriately. Turn in this recipe as “a3-part2b.wrangle”.
[0]
);
you can do multiple of these extractions in one step, though.We will repeat the operations in Part 1 but now using pandas.
Use panda’s read_csv
method to load the data. Be aware
that you may improve type inference if you are able to set
low_memory=False
. Store the loaded data frame into a
variable for use throughout the assignment.
The Medium
column is written with both capitalized words
and lowercase words. Standardize this by converting this to all
uppercase. Also strip any leading or trailing whitespace. You can verify
the need for the whitespace removal by finding the number of unique
values for the column before and after this change. Replace the original
Medium
column with your cleaned column.
.str
prefix
before calling the method.There are only a few problematic accession year, and Trifacta should
allow you to see these by clicking the red bar in the column. Based on
these values, create a rule to extract only the year, making sure to
remove the original column and rename the new column
AccessionYear
.
.str
prefix
before calling the method.There are certain years where the Object End Date
is
before the Object Begin Date
. Examine these by writing a
boolean query. These fields should correspond with the entries in the
Object Date
field. In some cases, the end date is 0 which
may be a placeholder for either an unknown or missing value or the same
year. In other cases, it seems like something went wrong with the
translation specifically with dates that begin in 664 B.C. (-664). We
will try to fix the second case, and set the others to null values to
indicate that there is some issue with the object end date. To
accomplish this, craft a way to extract the end date from the
Object Date
column but only when we see a dash (either - or
–) and the “B.C.” ending since we know that is where things are
problematic. (Consider a capturing regular expression for this.) Now,
replace the values in the Object End Date
column with such
dates. Finally, fill the remaining dates where the end date is before
the begin date with NaN
values to indicate we don’t know
that value.
np.nan
value to fill the
problematic cells.Create a third dataframe that contains artist information split into
rows for each of the objects in the collection dataset. This requires
less manual effort than in Wrangler because we can loop over the
columns. Like Part a, we can split the individual columns into lists.
However, pandas will not allow you to explode multiple columns at once;
it will allow you to create a new data frame from all of the individual
exploded columns. Again, we are interested in four columns:
Artist Role
, Artist Display Name
,
Artist Begin Date
, and Artist End Date
, and
need to keep the correspondences between the extracted elements. In
other words, the first element in the Artist Role
column
should be paired with the first element in the
Artist Display Name
column. The final table should retain
Object ID
.
Object Begin Date
, Artist Begin Date
, and
Artist End Date
.