The goal of this assignment is to use pandas to work on data integration and data fusion.
You may choose to work on the second part of the assignment on a hosted environment (e.g. Google Colab) or on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work (although Colab’s 3.6 should work). To use cloud resources, create/login to a free Google account for Colab. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab
.
In this assignment, we will be working with data related to the education around the world. The goal is to integrate data about different countries in order to understand different relationships and deal with information that comes from multiple sources, and to try strategies to fuse datasets together. We will use a subset of the UNESCO data, a subset of data from the World Bank, and a subset of data from the OECD. The data subsets are available here:
Attributes should be decipherable, but please ask if you have any questions about the columns.
The assignment is due at 11:59pm on Monday, March 22.
You should submit the completed notebook file named a4.ipynb
on Blackboard. Please use Markdown headings to clearly separate your work in the notebook.
CS 680 students are responsible for all parts; CS 490 students do not need to complete Part 2. Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells.
The World Bank subset is not in a tidy format. To fix it, melt the year columns into a new Year column and Value column to match the other datasets. Keep the rest of the columns around except for the final, unnamed column. Your transformed dataset should have 566,280 rows. Convert the new Year column to the int
type.
columns
property.pd.melt
has keyword arguments to help you name the output columns (Year, Value)In the World Bank subset, we have missing values for some years. In order to allow more specific comparisons between different countries, we can fill in the missing values based on bracketing values. However, note that we need to restrict the operation to the same country and indicator. Thus, we need to group by those columns first, and then run imputation. However, we also don’t want to add values outside of the reported range. For example, if El Salvador reports values for 1983 and 1986 but not for 1984 and 1985, we can impute those missing values, but if it reports no values before 1983 or after 2010, we do not want to interpolate fill in values outside of that range but rather leave them blank. Pandas provides the apply
method for groups that allows a user-defined function that takes the grouped data frame and returns a modified data frame.
interpolate
command can help with interpolation.loc
is inclusive.Now, load the UNESCO subset. The UNESCO dataset has information on teachers that can be integrated with the World Bank dataset. Note that because we have tidy data already, our goal is to integrate the datasets so that columns of teachers and enrollment can be compared in each row. Specifically, we will compute the ratio of students to teachers in primary schools.
Filter the UNESCO dataset for the three indicators related to teachers in primary schools. Pivot those three indicators to columns. Filter the World Bank dataset for indicators related to enrollment in primary schools. Pivot those two indicators to columns as well.
Merge the two datasets by country and year. Think carefully about which column to use for country.
Compute a new column that stores the student to teacher ratio (number of students per teacher) for each country-year (e.g. “Spain, 2014”) in the merged data. Find the top and bottom five country-years with respect to the computed ratio.
All three datasets (UNESCO, World Bank, OECD) have enrollment values that overlap for some of the years. We would like to compare where datasets agree and disagree and examine those values that are quite different. Note that the OECD has only 37 members so it significantly limits the amount of data that can be examined here among all three datasets. Note that you may choose different strategies in terms of renaming columns or how to merge datasets. For the first part, having the different dataset values in different columns may work best, but for the final fusion, having a single column works better. We will only be working with one indicator here (primary enrollment for all genders).
Extract the dataset values for total enrollment in primary education (total, all genders). Merge the rows of all three datasets based on the year and three-character country code for the country.
Compare the names of the countries among the three datasets. For which countries do the datasets disagree on naming? List all of them.
For those values where we have all values, determine how many values each pair of datasets agrees on, disregarding missing values. Output these rows. For which country-year pairs is this absolute difference significant (> 1000)? There should be three.
Now, let’s create a fused dataset that provides the best estimates for all country-year pairs for primary school enrollment. Make sure all the columns are renamed uniformly ('CountryCode','Country','Year','PrimaryStudents'
). Concatenate all three renamed datasets, and use the median value of the (up to three) values for the specified country and year as the final value for the dataset.