Assignment 4

Goals

The goal of this assignment is to use pandas to work on data integration and data fusion.

Instructions

You may choose to work on this assignment on a hosted environment (e.g. Google Colab) or on your own local installation of Jupyter and Python. You should use Python 3.12 or higher for your work. To install Python, consider using uv, miniforge, or Anaconda. Also, install JupyterLab (or Jupyter Notebook). For this assignment, you will need to make sure the polars package is installed. If you have trouble loading parquet files, install pyarrow (conda install pyarrow). You should also install install splink (conda install splink) for entity resolution. Please refer to the data integration courselet for help on some of the techniques used in this assignment.

In this assignment, we will be working with data related to global university rankings. Various organizations try to rank large universities around the world using a variety of different statistics. Some also publish information about the universities themselves including their location and enrollments. We will be working with four different datasets from different organizations: QS World University Rankings (QS), Times Higher Education World University Rankings (THE), U.S. News & World Report Best Global University Rankings (USNWR), and the Academic Ranking of World Universities (ARWU), also known as the Shanghai Rankings. Each dataset has different attributes about the universities, and we will try to integrate them together. Since we are focusing on data integration and data fusion, some of the initial data transformation steps have already been completed, and these datasets are available from the following URLS:

In addition, we will be using the United Nations Statistics Division’s geoscheme to help with region mapping. You can download the CSV file from that site. Attributes should be decipherable and match provided documentation in most cases, but please ask if you have any questions about the columns.

Due Date

The assignment is due at 11:59pm on Friday, November 7.

Submission

You should submit the completed notebook file named a4.ipynb on Blackboard. Please use Markdown headings to clearly separate your work in the notebook.

Details

CS 640 students are responsible for all parts; CS 490 students do not need to complete Part 4b. Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells.

1. Reading, Cleaning, and Integrating Data

All of the data is in Parquet format. You can read it in via polars. You may also read the files directly from the URLs above using polars or download and use them locally. Before working on fusing these four datasets, we unsurprisingly have some data cleaning to do.

1a. Fix Enrollment Numbers (THE) (15 pts)

The enrollment numbers in the Times Higher Ed (THE) dataset are provided in string formats. First, convert the stats_num_students column to integer values. Note that these strings have commas in them. Second, the international enrollment (stats_pc_intl_students) is reported as a percentage (with % symbol). Convert this to a count based on the total enrollment number. Create a new column (e.g. stats_num_intl_students) to store this.

1b. Data Integration (THE) (15 pts)

The Times rankings (the) do not list a region (e.g., Europe, North America), but the other three datasets do. Using the location column in the Times dataset, create a new column subregion that maps each country to its corresponding region. The UN geoscheme should help with this, but note that some country names may not match exactly. You may need to do some additional string operations or manual mapping to get this correct. In addition, the UN does not assign regions for some areas due to political decisions. You may need to use an external resource (e.g. a search engine) to clear these up. Try to minimize the number of manual mappings that you have to do. You shouldn’t need to resort to manual matches for more than 11 locations.

Hints
  • Many of the issues can be cleared up by using a `join_where’ operation.
  • Look where country names overlap, but be careful about some matches (e.g. ‘Niger’ and ‘Nigeria’).

1c. Standardize Region Naming (15 pts)

All datasets now contain information about the region of the world where the university is located. However, the naming for regions is not consistent across datasets. For all datasets (QS, USNWR, ARWU), create a new column that standardizes the region names. Examine the domains for this column for each dataset and determine a reasonable way to do this. Keep the old column but treat it as a “subregion”. Note that “Australia/New Zealand” should be renamed “Oceania” and “Africas” should be renamed “Africa”. The benefit in keeping both is that if both datasets specialize the “Asia/Oceania” to distinguish them, we can use this to block on, but fall black on the larger region if either dataset does not provide this information. Note that the AWRU dataset will have no non-null values in the subregion column since it is the most general.

Hints
  • The value_counts method will be helpful in assessing the region names.

2. Schema Matching (15 pts)

For the rest of the assignment, we are going to focus on a few of the attributes that exist across the datasets, and a common schema will make our data fusion clearer. In this schema, we have the following attributes for each university:

  • name: the university’s name
  • alt_name: any alternative name for the university (if available)
  • country: the country where the university is located
  • subregion: the subregion where the university is located (e.g. North America, Asia)
  • region: the region where the university is located (e.g., Americas, Asia/Oceania)
  • enrollment: the total number of enrolled students
  • intl_enroll: the percentage of enrolled students who are from other countries (if available)
  • ranking: the university’s ranking in that dataset
  • source_dataset: the corresponding data source ("qs", "the", "usnwr", "arwu")

Determine which columns in each dataset map to this common schema. Then, transform the four datasets to match this common schema. Keep the datasets separate (for now), but you can drop all columns except for those listed above. You may keep other columns (e.g. city or address if you want to try to use those in entity resolution).

Hints
  • Be careful about identifying how columns map to the common schema
  • Use polars’ rename operations to rename columns
  • You can create a column with a literal value using pl.lit.

3. Entity Resolution

In order to fuse these datasets together, we need to identify which rankings correspond to the same university. We can use two different approaches to do this. First, we can use an exact match on university name, and second, we can use modern record linkage techniques using splink. We can use a common identifier to link the same rows together. You may choose to start with all of the datasets in individual tables or merge them into one table, but by the end, they should be merged into one table.

3a. Using exact name matches (10 pts)

Find common items across all datasets based on exact name matches, using either the name or alt_name attributes (either one with an exact match is ok). Set a common_id field so that any matches share the same common_id value. You may choose to do this in an aggregate or pairwise way to grow the clusters.

3b. Using probabilistic record linkage (25 pts)

Clearly, there is some variation in how university names are written in the different datasets. We would like to identify the same university across the datasets by using modern record linkage techniques. There are libraries that help with entity resolution as well; one that has many options is Splink. Use this toolkit to determine matches between the datasets. You can use the region and subregion as blocking fields. (There is a way to do this so that region is only used as a fallback when subregion is null on either side.) Splink allows either linking or deduplication; either should work, but if you use deduplication, you need to make sure to also block on the source field so that you are only linking across datasets. In order to compare different universities, you can use a string comparison on the name and country fields, as well as numeric comparisons on the enrollment and intl_enroll fields. You may need to experiment with different settings to get good results. Once you have identified pairs of records that match, you should cluster them to find the row for each dataset for each university. Then, assign this cluster number as a common_id field.

You may experiment with different clustering algorithms here, but you are not required to use a true clustering algorithm. You can instead just look for the agreed-upon maximum pairings. In other words, if unique_id_l \(\ell\) with source \(s_\ell\) has the highest match with unique_id_r \(r\) over source \(s_r\), and unique_id_r \(r\) with source \(s_r\) has the highest match with unique_id_l \(\ell\) over source \(s_\ell\), they agree. We can cluster these pairings together. You can check that you are getting non-trivial results by looking at the names of the universities that do not exactly match.

Hints
  • splink requires a number of steps to set up the pairwise match scores. Follow the examples to get some ideas about how to do this.

4. Data Fusion

From the end of the previous section of the assignment, you have a data frame with matching universities identified with the same common_id. Now, we will put all of the matching rows together and come up with a single record for each university. To do this, we need to decide on a common name, decide on enrollment numbers, and coalesce the ranking.

4a. Coalesce Name and Enrollment Information (15 pts)

Review the aggregation operations in polars to find methods for choosing either the majority’s opinion or the average value for the enrollment fields. Note that polars also has a mode operation, but it is not listed in the aggregation section and returns a list. Remember that if a dataset has no information on a particular value (e.g. enrollment), we will have to use information from other datasets.

4b. [CSCI 640 Only] Determine Fused Ranking (25 pts)

Determining a common ranking is tricky because each dataset has a different ranking system and ranks a different set of universities. We will have cases where one dataset ranks University A above University B and another dataset ranks University B above University A. To solve this, we will use the Plackett-Luce model to infer a common ranking from the individual rankings. We will use the choix library to do this. Take each datasets ranking as the order of common_ids and then feed this into the model to obtain a fused ranking. You may need to experiment with different settings to get a good result. The opt_rankings method seems to work best. To make the problem more tractable, only include universities who receive a rank of less than 250 from one of the lists. If you don’t, the optimization will take too long.

Extra Credit

  • CSCI 490 students may completed 4b for extra credit
  • All students may add subregions to the AWRU dataset using the UNSN data
  • All students may incorporate more information into the entity resolution process