CSCI 680/490 - Assignment 4

Goals

The goal of this assignment is to use pandas to work on data integration and data fusion.

Instructions

You may choose to work on the second part of the assignment on a hosted environment (e.g. Google Colab) or on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work (although Colab’s 3.6 should work). To use cloud resources, create/login to a free Google account for Colab. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab.

In this assignment, we will be working with data related to the education around the world. The goal is to integrate data about different countries in order to understand different relationships and deal with information that comes from multiple sources, and to try strategies to fuse datasets together. We will use a subset of the UNESCO data, a subset of data from the World Bank, and a subset of data from the OECD. The data subsets are available here:

Attributes should be decipherable, but please ask if you have any questions about the columns.

Due Date

The assignment is due at 11:59pm on Monday, March 22.

Submission

You should submit the completed notebook file named a4.ipynb on Blackboard. Please use Markdown headings to clearly separate your work in the notebook.

Details

CS 680 students are responsible for all parts; CS 490 students do not need to complete Part 2. Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells.

1. Data Transformation (15 pts)

The World Bank subset is not in a tidy format. To fix it, melt the year columns into a new Year column and Value column to match the other datasets. Keep the rest of the columns around except for the final, unnamed column. Your transformed dataset should have 566,280 rows. Convert the new Year column to the int type.

Hints:

You can interrogate a data frame for the list of columns using the columns property.
pd.melt has keyword arguments to help you name the output columns (Year, Value)

2. [CS680 Students Only] Imputation (20 pts)

In the World Bank subset, we have missing values for some years. In order to allow more specific comparisons between different countries, we can fill in the missing values based on bracketing values. However, note that we need to restrict the operation to the same country and indicator. Thus, we need to group by those columns first, and then run imputation. However, we also don’t want to add values outside of the reported range. For example, if El Salvador reports values for 1983 and 1986 but not for 1984 and 1985, we can impute those missing values, but if it reports no values before 1983 or after 2010, we do not want to interpolate fill in values outside of that range but rather leave them blank. Pandas provides the apply method for groups that allows a user-defined function that takes the grouped data frame and returns a modified data frame.

Hints:

Make sure to sort the data by year.
Pandas interpolate command can help with interpolation
Pandas has first_valid_index (last_valid_index) to determine the index of the first (last) non-missing value.
Remember pandas .loc is inclusive.

3. Data Integration (30 pts)

Now, load the UNESCO subset. The UNESCO dataset has information on teachers that can be integrated with the World Bank dataset. Note that because we have tidy data already, our goal is to integrate the datasets so that columns of teachers and enrollment can be compared in each row. Specifically, we will compute the ratio of students to teachers in primary schools.

a. Filter and Pivot (10 pts)

Filter the UNESCO dataset for the three indicators related to teachers in primary schools. Pivot those three indicators to columns. Filter the World Bank dataset for indicators related to enrollment in primary schools. Pivot those two indicators to columns as well.

b. Merge (10 pts)

Merge the two datasets by country and year. Think carefully about which column to use for country.

c. Compute Ratio (10 pts)

Compute a new column that stores the student to teacher ratio (number of students per teacher) for each country-year (e.g. “Spain, 2014”) in the merged data. Find the top and bottom five country-years with respect to the computed ratio.

4. Data Fusion (45 pts)

All three datasets (UNESCO, World Bank, OECD) have enrollment values that overlap for some of the years. We would like to compare where datasets agree and disagree and examine those values that are quite different. Note that the OECD has only 37 members so it significantly limits the amount of data that can be examined here among all three datasets. Note that you may choose different strategies in terms of renaming columns or how to merge datasets. For the first part, having the different dataset values in different columns may work best, but for the final fusion, having a single column works better. We will only be working with one indicator here (primary enrollment for all genders).

a. Extract and Combine (15 pts)

Extract the dataset values for total enrollment in primary education (total, all genders). Merge the rows of all three datasets based on the year and three-character country code for the country.

Hints

You will need to do two merges to merge three datasets.
Because you may have columns with similar names (especially if you renamed uniformly after extraction), Pandas’ merge allows you to specify the suffixes so these could be tied to the dataset names to be more meaningful.

b. Country Names Comparison (10 pts)

Compare the names of the countries among the three datasets. For which countries do the datasets disagree on naming? List all of them.

c. Primary Enrollment Comparison (10 pts)

For those values where we have all values, determine how many values each pair of datasets agrees on, disregarding missing values. Output these rows. For which country-year pairs is this absolute difference significant (> 1000)? There should be three.

Hints

Be careful to check absolute difference: -1001 and 1001 are both significant.

d. Fused Dataset (10 pts)

Now, let’s create a fused dataset that provides the best estimates for all country-year pairs for primary school enrollment. Make sure all the columns are renamed uniformly ('CountryCode','Country','Year','PrimaryStudents'). Concatenate all three renamed datasets, and use the median value of the (up to three) values for the specified country and year as the final value for the dataset.

Hints

Pandas’ concat method will be useful in combining all the datasets once they are renamed

Extra Credit

CS 490 students may do Part 2 for extra credit.
Create reusable functions that will fuse other indicators given proper information.
See if there is any indication that one source is copying from the other.