The goal of this assignment is to use pandas to work on data integration and data fusion.
You should work on the other part of the assignment on your own local
installation of Jupyter and Python. You should use Python 3.8 or higher
for your work. When working locally, Anaconda is the easiest
way to install and manage Python, and you may launch Jupyter Lab either
from the Navigator application or via the command-line as
jupyter-lab
.
In this assignment, we will be working with data to the artists we have seen in the previous assignments. Various institutions, including the Met, have made data available about the artists in their collections. The goal is to integrate data about different artists in order to understand different attributes and deal with information that comes from multiple sources, and to try strategies to fuse datasets together. We will use datasets from the Metropolitan Museum of Art (Met), the National Gallery of Art (NGA), the Tate, and the Carnegie Museum of Art (CMOA). The datasets are derived from their respective open data sources hosted on Github (Met, NGA, Tate, CMOA). Since we are focusing on data integration and data fusion, some of the initial data transformation steps have been done, and those transformed datasets are available from the following URLs:
Attributes should be decipherable, but please ask if you have any questions about the columns.
The assignment is due at 11:59pm on Monday, April 4.
You should submit the completed notebook file named
a4.ipynb
on Blackboard. Please use Markdown
headings to clearly separate your work in the notebook.
CS 680 students are responsible for all parts; CS 490 students do not need to do probabilistic record linking (Part 4c). Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells.
All of the data is in CSV format. You can read it in via pandas. I
recommend downloading the files locally, but you may also read them
directly from the URLs above using pandas. Before working on
integration, we unsurprisingly have some data cleaning to do.
Specifically, we need to fix some of the year information in these
datasets. The Met dataset needs the Artist Begin Date
and
Artist End Date
columns transformed to numeric years,
including truncating those entries with YYYY-MM-DD
format
to YYYY
. In addition, the 9999 values indicate the artist
is still alive and should be changed to NaN
. Similarly, the
CMOA dataset has birth_date
and death_date
columns that should be transformed to YYYY
format and
converted to a numeric value.
The CMOA data has information about an artist’s places of birth and
death, but that is not recorded in the Met data. Add these two columns
from the CMOA to create a new version of the Met dataset with this
information. Note that we want to preserve all of the data from the
Met’s dataset, appending information from CMOA when available (think
about the correct type of join/merge). For this section, assume the
artist are the same if their “forward”
<first-name> <last-name>
names match. There
should be 1171 matches.
For the rest of the assignment, we are going to focus on a few of the attributes that exist across the datasets, and a common schema will make our data fusion clearer. In this schema, we have the following attributes for each artist:
name
: the
<first-name> <last-name>
version of the
artist’s namecited_name
: the
<last-name>, <first-name>
version of the
artist’s namebirth_year
: the year of the artist’s birth (also listed
as the “begin” year)death_year
: the year of the artist’s death (also listed
as the “end” year, NaN if the artist is still alive)nationality
: the nationality of the artistulan_id
: the ULAN id (only in two datasets)source
: the corresponding data source
("met", "nga", "tate", "cmoa"
)Determine which columns in each dataset map to this common schema.
Note that the Tate dataset does not have one of the name versions nor a
nationality. The NGA and Met are the only two with ULAN ids (See Part 4a
for information about how we will use these; you may transform the Met’s
ulan_id
at this step if you wish). Then, transform the four
datasets to match this common schema. Keep the datasets separate (for
now), but drop all columns except for the seven above.
We now wish to identify the same artist in the different datasets. We will use three different approaches, one using known identifiers, another using exact name matches, and the final one using probabilistic record linkage.
Determining which artists are the same across different datasets can
be difficult, especially when their names are very similar, or even the
same. The Met and the National Gallery of Art (NGA) have both added Getty
Union List of Artist Names (ULAN) identifiers to their artist
information, and we can use this to tie artists in those two datasets
together. This can also serve as a way to match names in the other
datasets. The Met dataset will require some string operations to extract
just the identifier; for
http://vocab.getty.edu/page/ulan/500336597
, the ULAN
identifier is 500336597
. Join items from the two datasets
when they have matching ULAN identifiers. Then, output the names that do
not exactly match in at least one of the two name versions
(name
and cited_name
); there should be 1031 of
the 3325 matched.
Now, do merges based on exact name matches, using either the name or cited_name fields (either one with an exact match is ok).
Clearly, there is variation in how names are written in the Met and
NGA datasets, and we expect this to be the case for the other datasets
as well. There are libraries that help with entity resolution as well;
one for python that uses probabilistic record linkage is fuzzymatcher. You can
install it using pip or conda (pip install fuzzymatcher
or
conda install -c conda-forge fuzzymatcher
). Use it to try
to determine similar records by matching pairwise on
both the “firstName lastName” and “lastName, firstName”
formats. This may take some time to run.
link_table
method. It gives both a
rank and a score. We want to take the top-ranked
match, but may wish to leave some records unmatched if
the score is too low.Using the matches from the above parts, create a single new dataframe
that contains all of the merged information. There are a number of ways
of putting all of the pairwise matches together. One option is to do
this hierarchically. Another option is to build clusters and make sure
those clusters agree with each other. Keep the original rows, but create
a new unique artist_id
column that indicates a unique
artist. Now this unique id will appear in multiple rows for the artists
that appear in different datasets. Your matches should be mostly
correct, but do not need to be totally correct if you have used the
fuzzymatcher.
At the end of Part 4, you should have a data frame with matching
artists identified with the same artist_id
. Now, we will
put all of the datasets together and come up with a
single record for each artist. To do this, we need to
decide on a common name, decide on birth/death dates, and coalesce
nationality.
Review the aggregate
method in pandas to find methods for choosing either the majority’s
opinion or the median value. Note that the Tate data has not been
updated for several years, and therefore certain artists who have since
passed away are listed in that dataset without dates of death. Null
values are not included in the median calculation, but for nationality
values that are empty strings, these are included.
Examine nationality
carefully and add at least
two transformations that leverage domain or statistical
knowledge about particular nationalities to improve the aggregation. It
may be useful to see which values appear in some datasets and not in
others. Make sure to create a final dataset where each artist appears
only once. Again, it is ok if your data is not perfect, but it should
correctly use the linking methods and choose the correct aggregate
methods.