Assignment 4

Goals

The goal of this assignment is to use pandas to work on data integration and data fusion.

Instructions

You should work on the other part of the assignment on your own local installation of Jupyter and Python. You should use Python 3.10 or higher for your work. When working locally, Anaconda or miniforge are the easiest ways to install and manage Python, and you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab. If you have trouble loading parquet files, install pyarrow (conda install pyarrow).

In this assignment, we will be working with data to the artists we have seen in the previous assignments. Various institutions, including the Met, have made data available about the artists in their collections. The goal is to integrate data about different artists in order to understand different attributes and deal with information that comes from multiple sources, and to try strategies to fuse datasets together. We will use datasets from the Metropolitan Museum of Art (Met), the National Gallery of Art (NGA), the Art Institue of Chicago (AIC), and the Cleveland Museum of Art (CMA). The datasets are derived from their respective open data sources hosted on Github (Met, NGA, AIC, CMA). Since we are focusing on data integration and data fusion, some of the initial data transformation steps have been done, and those transformed datasets are available from the following URLs:

The second NGA parquet file are alternative names for the artists, aligned via the constituentid attribute. Attributes should be decipherable and match provided documentation in most cases, but please ask if you have any questions about the columns.

Due Date

The assignment is due at 11:59pm on Friday, April 5.

Submission

You should submit the completed notebook file named a4.ipynb on Blackboard. Please use Markdown headings to clearly separate your work in the notebook.

Details

CS 640 students are responsible for all parts; CS 490 students do not need to do probabilistic record linking (Part 4c). Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells.

1. Reading & Cleaning Data (35 pts)

All of the data is in Parquet format. You can read it in via pandas (make sure pyarrow is installed). You may also read the files directly from the URLs above using pandas or download and use them locally. Before working on integration, we unsurprisingly have some data cleaning to do.

1a. Fix Year Information (Met) (10 pts)

We need to fix some of the year information in these datasets. The Met dataset needs the Artist Begin Date and Artist End Date columns transformed to numeric years, including truncating those entries with YYYY-MM-DD format to YYYY. In addition, the 9999 values indicate the artist is still alive and should be changed to NaN.

1b. Extract Nationality Information (AIC) (10 pts)

The AIC Dataset has a field artist_display which is merged from the artwork table, and this is the only location where we can extract nationality information. The field is a list of entries. Most of the time, these entries will be similar. Extract nationality information from this field and create a new column for it.

1c. Extract Name and Nationality Information (CMA) (15 pts)

The CMA Dataset puts all information about the artist in a description field, and that information may be replicated more than once (if an artist has multiple works in the museum with different descriptions). We need to extract the name and nationality (if listed) as best we can. It is unlikely you will be able to catch all cases, and that is ok! Don’t spend too much time trying to extract each name correctly.

2. Schema Matching (15 pts)

For the rest of the assignment, we are going to focus on a few of the attributes that exist across the datasets, and a common schema will make our data fusion clearer. In this schema, we have the following attributes for each artist:

name: the <first-name> <last-name> version of the artist’s name
alt_name: the <last-name>, <first-name> version of the artist’s name or another version of the name, e.g. from the altnames field
birth_year: the year of the artist’s birth (also listed as the “begin” year)
death_year: the year of the artist’s death (also listed as the “end” year, NaN if the artist is still alive)
nationality: the nationality of the artist
ulan_id: the ULAN id (only in two datasets)
source: the corresponding data source ("met", "nga", "aic", "cma")

Determine which columns in each dataset map to this common schema. While the AIC has a field for ULAN id, there is only one entry with this information. The NGA and Met do have ULAN ids (See Part 4a for information about how we will use these; you may transform the Met’s ulan_id at this step if you wish). Then, transform the four datasets to match this common schema. Keep the datasets separate (for now), but drop all columns except for the seven above.

Hints

Be careful about identifying how columns map to the common schema
Use pandas’ rename operations to rename columns
If you create a dictionary for the column remap, you can use the values of that dictionary to project to the desired columns (adding source after)
Use pandas’ assign to set the source for each datasets

3. Entity Resolution

We now wish to identify the same artist in the different datasets. We will use three different approaches, one using known identifiers, another using exact name matches, and the final one using probabilistic record linkage. We will create a new column (common_id) which will be filled in with the same identifier when the entities are the same. You may choose to start with all of the datasets in separate tables or merge them into one table, but by the end, the should be merged into one table.

3a. Using ULAN Identifiers (10 pts)

Determining which artists are the same across different datasets can be difficult, especially when their names are very similar, or even the same. The Met and the National Gallery of Art (NGA) have both added Getty Union List of Artist Names (ULAN) identifiers to their artist information, and we can use this to tie artists in those two datasets together. This can also serve as a way to match names in the other datasets. The Met dataset will require some string operations to extract just the identifier; for http://vocab.getty.edu/page/ulan/500336597, the ULAN identifier is 500336597. Find items from the two datasets that have the same ULAN identifier, and assign them the same common_id. (Note that we can exlcude these rows from matches between Met and NGA.)

3b. Using exact name matches (10 pts)

Find common items across all datasets based on exact name matches, using either the name or cited_name fields (either one with an exact match is ok). Again, set the common_id field so that any matches share the same common_id value. You may choose to do this in a hierarchical or pairwise way to grow the clusters.

3c. [CS 640 Only] Using probablistic record linkage (20 pts)

Clearly, there is variation in how names are written in the different datasets. There are libraries that help with entity resolution as well; two for python that have many features are the Python Record Linkage Toolkit and Splink. Follow the instructions for these toolkits make sure that matching items are assigned the same common_id value. Note that it is possible that you link two artists in the same dataset. Generally, this is unlikely as the museums have tried to deduplicate this, but it is possible. You may wish to use different varieties of the names as well as the birth and death years to help block and/or match artists.

3d. Concatenation (5 pts)

Using the matches from the above parts, create one new dataframe that contains all of the merged information. You may also need to assign individual common_ids to those rows that have no matches in the other datasets.

Hints

Matches from 4a should have priority over other matches because we they have linked ids.
Use concat to concatenate multiple datasets
If you know that a (<source>, <index>) tuple maps to a new artist_id, you can use apply to map that pair of columns to the new artist id.

4. Data Fusion (20 pts)

At the end of Part 4, you should have a data frame with matching artists identified with the same common_id. Now, we will put all of the datasets together and come up with a single record for each artist. To do this, we need to decide on a common name, decide on birth/death dates, and coalesce nationality.

Review the aggregate method in pandas to find methods for choosing either the majority’s opinion or the median value for the name, birth date, and death date fields. Remember that if a dataset has no information on a particular value (e.g. death date), this could be something we wish to fill in from the other datasets. However, it is possible that one of the datasets has incorrect information.

Examine nationality carefully and add at least two transformations that leverage domain or statistical knowledge about particular nationalities to improve the aggregation. It may be useful to see which values appear in some datasets and not in others. Make sure to create a final dataset where each artist appears only once. Again, it is ok if your data is not perfect, but it should correctly use the linking methods and choose the correct aggregate methods.

Extra Credit

CSCI 490 students may completed 4c for extra credit
Show the results using different probablistic matching tools/schemes and justify your chosen settings.
See if there is any indication that one source is copying from the other.