Assignment 4

Goals

The goal of this assignment is to use pandas to work on data integration and data fusion.

Instructions

You should work on the other part of the assignment on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work. When working locally, Anaconda is the easiest way to install and manage Python, and you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab.

In this assignment, we will be working with data about world travel. Various institutions, The goal is to integrate data about different countries in order to understand different relationships and deal with information that comes from multiple sources, and to fuse datasets together. We will use datasets from the United Nations World Tourism Organization (UNWTO), the World Bank (WB), the OECD, and the United Nations World Population Division (UNPOP). Since we are focusing on data integration and data fusion, some of the initial data transformation steps have been done, and those transformed datasets are available from the following Parquet URLs:

or CSV URLs:

In the World Bank dataset, ST.INT.RCPT.CD are the receipts and ST.INT.XPND.CD are the expenditures. In the UNWTO dataset, “Inbound Tourism expenditure in the country” are the receipts and “Outbound Tourism expenditure in other countries” are the expenditures. In the OECD dataset, “Total international expenditure” and “Total international receipts” are the columns for expenditures and receipts, respectively. Other attributes should be mostly decipherable, but please ask if you have any questions about the columns.

Due Date

The assignment is due at 11:59pm on Wednesday, April 5.

Submission

You should submit the completed notebook file named a4.ipynb on Blackboard. Please use Markdown headings to clearly separate your work in the notebook.

Details

CS 680 students are responsible for all parts; CS 490 students do not need to do probabilistic record linking (Part 4c). Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells.

1. Reading & Cleaning Data (15 pts)

All of the data can be read via pandas. You may load either the parquet or csv versions, but I recommend parquet. (You may need to install pyarrow to use that format.) You may either read the data directly from the URLs above using pandas or download them locally first. Before working on integration, we have some data cleaning to do. First, the OECD dataset has the values as strings, and uses .. to indicate that a value is unknown. Convert that indicator to the NaN value and convert the column to a float. Second, the WB dataset states the value in US Dollars while the units for the other two datasets are in millions of US Dollars. Convert so that all datasets have the same units for the values. Finally, all datasets are in a longer data format but are not tidy as the variables for a given country-year are part of a single observation so transform the variables column so that those variables are separate attributes.

Hints:

Think about which of melt or pivot should be used for the transformation.
You can import numpy in order to specify the np.nan

2. Data Imputation (15 pts)

The WB data has missing values for some years. In order to allow more specific comparisons between different countries, we can estimate the missing values based on the surrounding values. However, note that we need to restrict the operation to the same country and indicator (receipts or expenditures). Thus, we need to group by those columns first, and then run imputation. However, we also don’t want to add values outside of the reported range. For example, if Austria reports values for 1995-2005 and 2015-2020, we can impute the missing values from 2005-2015, but if it reports no values before 1995 or after 2020, we do not want to fill in values outside of that range but rather leave them blank. Pandas provides the apply method for groups that allows a user-defined function that takes the grouped data frame and returns a modified data frame. (Usually, we try avoid apply calls, but for some operations, it is necessary.)

Hints

The interpolate method has a parameter that is useful for only imputing values between known values

3. Schema Matching (15 pts)

For the rest of the assignment, we are going to focus on a few of the attributes that exist across the datasets, and a common schema will make our data fusion clearer. In this schema, we have the following attributes for each country-year entry:

countryCodeNum: the numeric code for the country
countryCodeAlpha: the alphabetic code for the country
country: the name of the country
year: the year the amounts were recorded
receipts: the amount of money spent in the country on travel
expenditures: the amount of money residents spend in other countries while on travel
source: the corresponding data source ("unwto", "wb", "oecd")

Determine which columns in each dataset map to this common schema. Note that some datasets will not have (some of) the codes! Then, transform the three datasets to match this common schema. Keep the datasets separate (for now), but drop all columns except for the seven above.

Hints

Be careful about identifying how columns map to the common schema
Use pandas’ rename operations to rename columns
If you create a dictionary for the column remap, you can use the values of that dictionary to project to the desired columns (adding source after)
Use pandas’ assign to set the source for each datasets

4. Entity Resolution

We now wish to identify the same country in the different datasets. We will standardize everything using the UN Population country names and codes, which includes both alpahbetic and numeric country code information. Before starting, it will be useful to create a new table (country_codes) that has just the country names and codes, deduplicated. We will use three different approaches, one using known codes, another using exact name matches, and the final one using probabilistic record linkage. At the end, each row of the dataset should have the countryName, countryCodeAlpha, and countryCodeNum columns filled. You may drop rows that cannot be resolved (e.g. the regions like Europe in the World Bank data).

4a. Using Country Codes (10 pts)

The World Bank dataset includes alphabetic country codes that can be matched with the country codes. Join this dataset with the country names and both code columns from the population data. The UNWTO dataset contains the numeric country codes that can be matched with the population data. Join this dataset with the country names and both code columns from that population data. Note that the World Bank contains data that includes regions, not just countries. They can be excluded. The UNWTO data has a few outdated numeric codes. Resolve these either manually or using a similar technique from b or c.

4b. Using name matches (10 pts)

For the OECD dataset, do merges with the UN population data based on exact country name matches. However, note that punctuation and capitalization can cause problems so before joining, make sure both columns have punctuation removed and all characters in a single case (lower or upper). Keep whitespace between components.

4c. [CS 680 Only] Using record linkage tools (10 pts)

There is some variation in how names are written in the OECD and UN Population datasets, and we expect this to be the case for the other datasets as well. There are libraries that help with entity resolution as well; one for python that uses probabilistic record linkage is fuzzymatcher, but you might also use the Splink or the Python Record Linkage Toolkit. You can these using pip or conda/mamba (e.g. pip install fuzzymatcher or conda install -c conda-forge fuzzymatcher). Use one of them to try to determine similar records by matching pairwise. You may also need to fix a couple of country names manually.

Hints:

If you use fuzzymatcher, the link_table method may be useful because it gives both a rank and a score. We usually want to take the top-ranked match, but if it is incorrect, the other matches may be useful in resolving the names.

4d. Concatenation (5 pts)

Using the matches from the above parts, create one new dataframe that contains all of the merged information. After Parts a-c, the code columns should be filled for all three datasets. These codes (along with the year) will serve as the ids that link the same country together from each dataset.

Hints

Use concat to concatenate multiple datasets

5. Data Fusion (15 pts)

At the end of Part 4, you should have a data frame with matching countries identified with the same country code (either numeric or alphabetic). Now, we will put all of the datasets together and come up with a single record for each country. To do this, we need to decide on a common name and coalesce the receipts and expeditures numbers.

Review the aggregate method in pandas to find methods for choosing either the majority’s opinion or the mean or median value. Null values are not included in the mean/median calculation. Create a final dataset where each country-year pair appears only once. It is ok if your data is not perfect, but it should correctly use the linking methods and choose the correct aggregate methods.

6. Data Integration (15 pts)

Join the fused dataset with the full UN population data and calculate per captia values for the receipts and expenditures (the average amount each resident received or spent related to travel). Add the following columns:

population: the population (estimate) for the country in the given year
perCapitaRcpts: the amount of money received in the country on a per capita basis
perCapitaExpnds: the amount of money spent on travel in other countries on a per capita basis

Each of the per capita columns is calculated by dividing the value by the population. Remember that if your values are in millions of US Dollars, you need to take that into consideration during the calculation.

An example snippet of the output is shown below.

Hints:

Remeber you need to join on both country and year.
The population is given in thousands of dollars.

	country	countryCodeNum	countryCodeAlpha	year	receipts	expenditures	population	perCapitaRcpts	perCapitaExpnds
12846	United States of America	840	USA	2012	195113.0	138167.0	315271.591	618.872761	438.247543
12847	United States of America	840	USA	2013	213105.0	132322.0	318031.051	670.076080	416.066292
12848	United States of America	840	USA	2014	222746.0	140228.0	320719.281	694.520140	437.229716
12849	United States of America	840	USA	2015	230574.0	144669.0	323348.646	713.081693	447.408708
12850	United States of America	840	USA	2016	228549.0	147640.0	325866.907	701.356889	453.068406
12851	United States of America	840	USA	2017	233758.0	158287.0	328553.489	711.476237	481.769347
12852	United States of America	840	USA	2018	241984.0	176607.0	331028.972	731.005502	533.509194
12853	United States of America	840	USA	2019	239256.0	185573.0	333251.103	717.945111	556.856371
12854	United States of America	840	USA	2020	84252.5	48183.5	335388.238	251.208869	143.664847
12855	United States of America	840	USA	2021	82974.0	73485.0	336495.769	246.582595	218.383132

Extra Credit

Try to determine if one source is copying from the other, and in which direction that occurs.