Assignment 4

Goals

The goal of this assignment is to use pandas to work on data integration and data fusion.

Instructions

You should work on the other part of the assignment on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work. When working locally, Anaconda is the easiest way to install and manage Python, and you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab.

In this assignment, we will be working with data to the artists we have seen in the previous assignments. Various institutions, including the Met, have made data available about the artists in their collections. The goal is to integrate data about different artists in order to understand different attributes and deal with information that comes from multiple sources, and to try strategies to fuse datasets together. We will use datasets from the Metropolitan Museum of Art (Met), the National Gallery of Art (NGA), the Tate, and the Carnegie Museum of Art (CMOA). The datasets are derived from their respective open data sources hosted on Github (Met, NGA, Tate, CMOA). Since we are focusing on data integration and data fusion, some of the initial data transformation steps have been done, and those transformed datasets are available from the following URLs:

Attributes should be decipherable, but please ask if you have any questions about the columns.

Due Date

The assignment is due at 11:59pm on Monday, April 4.

Submission

You should submit the completed notebook file named a4.ipynb on Blackboard. Please use Markdown headings to clearly separate your work in the notebook.

Details

CS 680 students are responsible for all parts; CS 490 students do not need to do probabilistic record linking (Part 4c). Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells.

1. Reading & Cleaning Data (15 pts)

All of the data is in CSV format. You can read it in via pandas. I recommend downloading the files locally, but you may also read them directly from the URLs above using pandas. Before working on integration, we unsurprisingly have some data cleaning to do. Specifically, we need to fix some of the year information in these datasets. The Met dataset needs the Artist Begin Date and Artist End Date columns transformed to numeric years, including truncating those entries with YYYY-MM-DD format to YYYY. In addition, the 9999 values indicate the artist is still alive and should be changed to NaN. Similarly, the CMOA dataset has birth_date and death_date columns that should be transformed to YYYY format and converted to a numeric value.

2. Data Augmentation (10 pts)

The CMOA data has information about an artist’s places of birth and death, but that is not recorded in the Met data. Add these two columns from the CMOA to create a new version of the Met dataset with this information. Note that we want to preserve all of the data from the Met’s dataset, appending information from CMOA when available (think about the correct type of join/merge). For this section, assume the artist are the same if their “forward” <first-name> <last-name> names match. There should be 1171 matches.

3. Schema Matching (15 pts)

For the rest of the assignment, we are going to focus on a few of the attributes that exist across the datasets, and a common schema will make our data fusion clearer. In this schema, we have the following attributes for each artist:

  • name: the <first-name> <last-name> version of the artist’s name
  • cited_name: the <last-name>, <first-name> version of the artist’s name
  • birth_year: the year of the artist’s birth (also listed as the “begin” year)
  • death_year: the year of the artist’s death (also listed as the “end” year, NaN if the artist is still alive)
  • nationality: the nationality of the artist
  • ulan_id: the ULAN id (only in two datasets)
  • source: the corresponding data source ("met", "nga", "tate", "cmoa")

Determine which columns in each dataset map to this common schema. Note that the Tate dataset does not have one of the name versions nor a nationality. The NGA and Met are the only two with ULAN ids (See Part 4a for information about how we will use these; you may transform the Met’s ulan_id at this step if you wish). Then, transform the four datasets to match this common schema. Keep the datasets separate (for now), but drop all columns except for the seven above.

Hints
  • Be careful about identifying how columns map to the common schema
  • Use pandas’ rename operations to rename columns
  • If you create a dictionary for the column remap, you can use the values of that dictionary to project to the desired columns (adding source after)
  • Use pandas’ assign to set the source for each datasets

4. Entity Resolution

We now wish to identify the same artist in the different datasets. We will use three different approaches, one using known identifiers, another using exact name matches, and the final one using probabilistic record linkage.

4a. Using ULAN Identifiers (10 pts)

Determining which artists are the same across different datasets can be difficult, especially when their names are very similar, or even the same. The Met and the National Gallery of Art (NGA) have both added Getty Union List of Artist Names (ULAN) identifiers to their artist information, and we can use this to tie artists in those two datasets together. This can also serve as a way to match names in the other datasets. The Met dataset will require some string operations to extract just the identifier; for http://vocab.getty.edu/page/ulan/500336597, the ULAN identifier is 500336597. Join items from the two datasets when they have matching ULAN identifiers. Then, output the names that do not exactly match in at least one of the two name versions (name and cited_name); there should be 1031 of the 3325 matched.

4b. Using exact name matches (10 pts)

Now, do merges based on exact name matches, using either the name or cited_name fields (either one with an exact match is ok).

4c. [CS 680 Only] Using fuzzymatcher (15 pts)

Clearly, there is variation in how names are written in the Met and NGA datasets, and we expect this to be the case for the other datasets as well. There are libraries that help with entity resolution as well; one for python that uses probabilistic record linkage is fuzzymatcher. You can install it using pip or conda (pip install fuzzymatcher or conda install -c conda-forge fuzzymatcher). Use it to try to determine similar records by matching pairwise on both the “firstName lastName” and “lastName, firstName” formats. This may take some time to run.

Hints:

  • Use the link_table method. It gives both a rank and a score. We want to take the top-ranked match, but may wish to leave some records unmatched if the score is too low.
  • The scores for matches with Tate data may be lower than for the other data because Tate is only being matched on one name.

4d. Integration (10 pts)

Using the matches from the above parts, create a single new dataframe that contains all of the merged information. There are a number of ways of putting all of the pairwise matches together. One option is to do this hierarchically. Another option is to build clusters and make sure those clusters agree with each other. Keep the original rows, but create a new unique artist_id column that indicates a unique artist. Now this unique id will appear in multiple rows for the artists that appear in different datasets. Your matches should be mostly correct, but do not need to be totally correct if you have used the fuzzymatcher.

Hints
  • Matches from 4a should have priority over other matches because we they have linked ids.
  • Use concat to concatenate multiple datasets
  • If you know that a (<source>, <index>) tuple maps to a new artist_id, you can use apply to map that pair of columns to the new artist id.

5. Data Fusion (20 pts)

At the end of Part 4, you should have a data frame with matching artists identified with the same artist_id. Now, we will put all of the datasets together and come up with a single record for each artist. To do this, we need to decide on a common name, decide on birth/death dates, and coalesce nationality.

Review the aggregate method in pandas to find methods for choosing either the majority’s opinion or the median value. Note that the Tate data has not been updated for several years, and therefore certain artists who have since passed away are listed in that dataset without dates of death. Null values are not included in the median calculation, but for nationality values that are empty strings, these are included.

Examine nationality carefully and add at least two transformations that leverage domain or statistical knowledge about particular nationalities to improve the aggregation. It may be useful to see which values appear in some datasets and not in others. Make sure to create a final dataset where each artist appears only once. Again, it is ok if your data is not perfect, but it should correctly use the linking methods and choose the correct aggregate methods.

Extra Credit

  • Improve the fuzzymatcher to use the initials of the artists to improve the matching. See the fuzzymatcher examples
  • See if there is any indication that one source is copying from the other.