The goal of this assignment is to use pandas to work on data integration and data fusion.
You may choose to work on this assignment on a hosted environment
(e.g. Google Colab) or
on your own local installation of Jupyter and Python. You should use
Python 3.12 or higher for your work. To install Python, consider using
uv, miniforge, or Anaconda. Also, install JupyterLab (or Jupyter
Notebook). For this assignment, you will need to make sure the polars
package is installed. If you have trouble loading parquet files, install
pyarrow (conda install pyarrow). You should
also install install splink
(conda install splink) for entity resolution. Please refer
to the data integration courselet
for help on some of the techniques used in this assignment.
In this assignment, we will be working with data related to global university rankings. Various organizations try to rank large universities around the world using a variety of different statistics. Some also publish information about the universities themselves including their location and enrollments. We will be working with four different datasets from different organizations: QS World University Rankings (QS), Times Higher Education World University Rankings (THE), U.S. News & World Report Best Global University Rankings (USNWR), and the Academic Ranking of World Universities (ARWU), also known as the Shanghai Rankings. Each dataset has different attributes about the universities, and we will try to integrate them together. Since we are focusing on data integration and data fusion, some of the initial data transformation steps have already been completed, and these datasets are available from the following URLS:
In addition, we will be using the United Nations Statistics Division’s geoscheme to help with region mapping. You can download the CSV file from that site. Attributes should be decipherable and match provided documentation in most cases, but please ask if you have any questions about the columns.
The assignment is due at 11:59pm on Friday, November 7.
You should submit the completed notebook file named
a4.ipynb on Blackboard. Please use Markdown
headings to clearly separate your work in the notebook.
CS 640 students are responsible for all parts; CS 490 students do not need to complete Part 4b. Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells.
All of the data is in Parquet format. You can read it in via polars. You may also read the files directly from the URLs above using polars or download and use them locally. Before working on fusing these four datasets, we unsurprisingly have some data cleaning to do.
The enrollment numbers in the Times Higher Ed (THE) dataset are
provided in string formats. First, convert the
stats_num_students column to integer values. Note that
these strings have commas in them. Second, the international enrollment
(stats_pc_intl_students) is reported as a
percentage (with % symbol). Convert this to a count
based on the total enrollment number. Create a new column
(e.g. stats_num_intl_students) to store this.
The Times rankings (the) do not list a region (e.g., Europe, North
America), but the other three datasets do. Using the location column in
the Times dataset, create a new column subregion that maps
each country to its corresponding region. The UN geoscheme should help
with this, but note that some country names may not match exactly. You
may need to do some additional string operations or manual mapping to
get this correct. In addition, the UN does not assign regions for some
areas due to political decisions. You may need to use an external
resource (e.g. a search engine) to clear these up. Try to
minimize the number of manual mappings that you have to
do. You shouldn’t need to resort to manual matches for more than 11
locations.
All datasets now contain information about the region of the world where the university is located. However, the naming for regions is not consistent across datasets. For all datasets (QS, USNWR, ARWU), create a new column that standardizes the region names. Examine the domains for this column for each dataset and determine a reasonable way to do this. Keep the old column but treat it as a “subregion”. Note that “Australia/New Zealand” should be renamed “Oceania” and “Africas” should be renamed “Africa”. The benefit in keeping both is that if both datasets specialize the “Asia/Oceania” to distinguish them, we can use this to block on, but fall black on the larger region if either dataset does not provide this information. Note that the AWRU dataset will have no non-null values in the subregion column since it is the most general.
value_counts method will be helpful in assessing
the region names.For the rest of the assignment, we are going to focus on a few of the attributes that exist across the datasets, and a common schema will make our data fusion clearer. In this schema, we have the following attributes for each university:
name: the university’s namealt_name: any alternative name for the university (if
available)country: the country where the university is
locatedsubregion: the subregion where the university is
located (e.g. North America, Asia)region: the region where the university is located
(e.g., Americas, Asia/Oceania)enrollment: the total number of enrolled studentsintl_enroll: the percentage of enrolled students who
are from other countries (if available)ranking: the university’s ranking in that datasetsource_dataset: the corresponding data source
("qs", "the", "usnwr", "arwu")Determine which columns in each dataset map to this common schema. Then, transform the four datasets to match this common schema. Keep the datasets separate (for now), but you can drop all columns except for those listed above. You may keep other columns (e.g. city or address if you want to try to use those in entity resolution).
In order to fuse these datasets together, we need to identify which rankings correspond to the same university. We can use two different approaches to do this. First, we can use an exact match on university name, and second, we can use modern record linkage techniques using splink. We can use a common identifier to link the same rows together. You may choose to start with all of the datasets in individual tables or merge them into one table, but by the end, they should be merged into one table.
Find common items across all datasets based on exact
name matches, using either the name or
alt_name attributes (either one with an exact match is ok).
Set a common_id field so that any matches share the same
common_id value. You may choose to do this in an aggregate
or pairwise way to grow the clusters.
Clearly, there is some variation in how university names are written
in the different datasets. We would like to identify the same university
across the datasets by using modern record linkage techniques. There are
libraries that help with entity resolution as well; one that has many
options is Splink. Use
this toolkit to determine matches between the datasets. You can use the
region and subregion as blocking fields. (There is a way to do this so
that region is only used as a fallback when subregion is null on either
side.) Splink allows either linking or deduplication; either should
work, but if you use deduplication, you need to make sure to also block
on the source field so that you are only linking across datasets. In
order to compare different universities, you can use a string comparison
on the name and country fields, as well as
numeric comparisons on the enrollment and
intl_enroll fields. You may need to experiment with
different settings to get good results. Once you have identified pairs
of records that match, you should cluster them to find the row for each
dataset for each university. Then, assign this cluster number as a
common_id field.
You may experiment with different clustering algorithms here, but you
are not required to use a true clustering algorithm. You can instead
just look for the agreed-upon maximum pairings. In other words, if
unique_id_l \(\ell\) with
source \(s_\ell\) has the highest match
with unique_id_r \(r\)
over source \(s_r\), and
unique_id_r \(r\) with
source \(s_r\) has the highest match
with unique_id_l \(\ell\)
over source \(s_\ell\), they agree. We
can cluster these pairings together. You can check that you are getting
non-trivial results by looking at the names of the universities that do
not exactly match.
From the end of the previous section of the assignment, you have a
data frame with matching universities identified with the same
common_id. Now, we will put all of the matching rows
together and come up with a single record for each
university. To do this, we need to decide on a common name, decide on
enrollment numbers, and coalesce the ranking.
Review the aggregation
operations in polars to find methods for choosing either the majority’s
opinion or the average value for the enrollment fields. Note that polars
also has a mode
operation, but it is not listed in the aggregation section and returns a
list. Remember that if a dataset has no information on
a particular value (e.g. enrollment), we will have to use information
from other datasets.
Determining a common ranking is tricky because each dataset has a
different ranking system and ranks a different set of universities. We
will have cases where one dataset ranks University A above University B
and another dataset ranks University B above University A. To solve
this, we will use the Plackett-Luce
model to infer a common ranking from the individual rankings. We
will use the choix library to do
this. Take each datasets ranking as the order of common_ids and then
feed this into the model to obtain a fused ranking. You may need to
experiment with different settings to get a good result. The
opt_rankings method seems to work best. To make the problem
more tractable, only include universities who receive a rank of less
than 250 from one of the lists. If you don’t, the optimization will take
too long.