The goal of this assignment is to use pandas to work on data integration and data fusion.
You should work on the other part of the assignment on your own local
installation of Jupyter and Python. You should use Python 3.10 or higher
for your work. When working locally, Anaconda or miniforge are the
easiest ways to install and manage Python, and you may launch Jupyter
Lab either from the Navigator application or via the command-line as
jupyter-lab
. If you have trouble loading parquet files,
install pyarrow
(conda install pyarrow
).
In this assignment, we will be working with data to the artists we have seen in the previous assignments. Various institutions, including the Met, have made data available about the artists in their collections. The goal is to integrate data about different artists in order to understand different attributes and deal with information that comes from multiple sources, and to try strategies to fuse datasets together. We will use datasets from the Metropolitan Museum of Art (Met), the National Gallery of Art (NGA), the Art Institue of Chicago (AIC), and the Cleveland Museum of Art (CMA). The datasets are derived from their respective open data sources hosted on Github (Met, NGA, AIC, CMA). Since we are focusing on data integration and data fusion, some of the initial data transformation steps have been done, and those transformed datasets are available from the following URLs:
The second NGA parquet file are alternative names for the artists,
aligned via the constituentid
attribute. Attributes should
be decipherable and match provided documentation in most cases, but
please ask if you have any questions about the columns.
The assignment is due at 11:59pm on Friday, April 5.
You should submit the completed notebook file named
a4.ipynb
on Blackboard. Please use Markdown
headings to clearly separate your work in the notebook.
CS 640 students are responsible for all parts; CS 490 students do not need to do probabilistic record linking (Part 4c). Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells.
All of the data is in Parquet format. You can read it in via pandas (make sure pyarrow is installed). You may also read the files directly from the URLs above using pandas or download and use them locally. Before working on integration, we unsurprisingly have some data cleaning to do.
We need to fix some of the year information in these datasets. The
Met dataset needs the Artist Begin Date
and
Artist End Date
columns transformed to numeric years,
including truncating those entries with YYYY-MM-DD
format
to YYYY
. In addition, the 9999 values indicate the artist
is still alive and should be changed to NaN
.
The AIC Dataset has a field artist_display
which is
merged from the artwork table, and this is the only location where we
can extract nationality information. The field is a
list of entries. Most of the time, these entries will
be similar. Extract nationality information from this field and create a
new column for it.
The CMA Dataset puts all information about the artist in a description field, and that information may be replicated more than once (if an artist has multiple works in the museum with different descriptions). We need to extract the name and nationality (if listed) as best we can. It is unlikely you will be able to catch all cases, and that is ok! Don’t spend too much time trying to extract each name correctly.
For the rest of the assignment, we are going to focus on a few of the attributes that exist across the datasets, and a common schema will make our data fusion clearer. In this schema, we have the following attributes for each artist:
name
: the
<first-name> <last-name>
version of the
artist’s namealt_name
: the
<last-name>, <first-name>
version of the
artist’s name or another version of the name, e.g. from the altnames
fieldbirth_year
: the year of the artist’s birth (also listed
as the “begin” year)death_year
: the year of the artist’s death (also listed
as the “end” year, NaN if the artist is still alive)nationality
: the nationality of the artistulan_id
: the ULAN id (only in two datasets)source
: the corresponding data source
("met", "nga", "aic", "cma"
)Determine which columns in each dataset map to this common schema.
While the AIC has a field for ULAN id, there is only one entry with this
information. The NGA and Met do have ULAN ids (See Part 4a for
information about how we will use these; you may transform the Met’s
ulan_id
at this step if you wish). Then, transform the four
datasets to match this common schema. Keep the datasets separate (for
now), but drop all columns except for the seven above.
We now wish to identify the same artist in the different datasets. We
will use three different approaches, one using known identifiers,
another using exact name matches, and the final one using probabilistic
record linkage. We will create a new column (common_id
)
which will be filled in with the same identifier when the entities are
the same. You may choose to start with all of the datasets in separate
tables or merge them into one table, but by the end, the should be
merged into one table.
Determining which artists are the same across different datasets can
be difficult, especially when their names are very similar, or even the
same. The Met and the National Gallery of Art (NGA) have both added Getty
Union List of Artist Names (ULAN) identifiers to their artist
information, and we can use this to tie artists in those two datasets
together. This can also serve as a way to match names in the other
datasets. The Met dataset will require some string operations to extract
just the identifier; for
http://vocab.getty.edu/page/ulan/500336597
, the ULAN
identifier is 500336597
. Find items from the two datasets
that have the same ULAN identifier, and assign them the
same common_id
. (Note that we can exlcude
these rows from matches between Met and NGA.)
Find common items across all datasets based on exact
name matches, using either the name or cited_name
fields (either one with an exact match is ok). Again, set the
common_id
field so that any matches share the same
common_id
value. You may choose to do this in a
hierarchical or pairwise way to grow the clusters.
Clearly, there is variation in how names are written in the different
datasets. There are libraries that help with entity resolution as well;
two for python that have many features are the Python Record
Linkage Toolkit and Splink.
Follow the instructions for these toolkits make sure that matching items
are assigned the same common_id
value. Note that it is
possible that you link two artists in the same dataset.
Generally, this is unlikely as the museums have tried to deduplicate
this, but it is possible. You may wish to use different varieties of the
names as well as the birth and death years to help block and/or match
artists.
Using the matches from the above parts, create one
new dataframe that contains all of the merged information. You may also
need to assign individual common_id
s to those rows that
have no matches in the other datasets.
At the end of Part 4, you should have a data frame with matching
artists identified with the same common_id
. Now, we will
put all of the datasets together and come up with a
single record for each artist. To do this, we need to
decide on a common name, decide on birth/death dates, and coalesce
nationality.
Review the aggregate
method in pandas to find methods for choosing either the majority’s
opinion or the median value for the name, birth date, and death date
fields. Remember that if a dataset has no information on a particular
value (e.g. death date), this could be something we wish to fill in from
the other datasets. However, it is possible that one of the datasets has
incorrect information.
Examine nationality
carefully and add at least
two transformations that leverage domain or statistical
knowledge about particular nationalities to improve the aggregation. It
may be useful to see which values appear in some datasets and not in
others. Make sure to create a final dataset where each artist appears
only once. Again, it is ok if your data is not perfect, but it should
correctly use the linking methods and choose the correct aggregate
methods.