The goal of this assignment is to use pandas to work on data integration and data fusion.
You should work on the other part of the assignment on your own local
installation of Jupyter and Python. You should use Python 3.8 or higher
for your work. When working locally, Anaconda is the easiest
way to install and manage Python, and you may launch Jupyter Lab either
from the Navigator application or via the command-line as
jupyter-lab
.
In this assignment, we will be working with data about world travel. Various institutions, The goal is to integrate data about different countries in order to understand different relationships and deal with information that comes from multiple sources, and to fuse datasets together. We will use datasets from the United Nations World Tourism Organization (UNWTO), the World Bank (WB), the OECD, and the United Nations World Population Division (UNPOP). Since we are focusing on data integration and data fusion, some of the initial data transformation steps have been done, and those transformed datasets are available from the following Parquet URLs:
or CSV URLs:
In the World Bank dataset, ST.INT.RCPT.CD
are the
receipts and ST.INT.XPND.CD
are the expenditures. In the
UNWTO dataset, “Inbound Tourism expenditure in the country” are the
receipts and “Outbound Tourism expenditure in other countries” are the
expenditures. In the OECD dataset, “Total international expenditure” and
“Total international receipts” are the columns for expenditures and
receipts, respectively. Other attributes should be mostly decipherable,
but please ask if you have any questions about the columns.
The assignment is due at 11:59pm on Wednesday, April 5.
You should submit the completed notebook file named
a4.ipynb
on Blackboard. Please use Markdown
headings to clearly separate your work in the notebook.
CS 680 students are responsible for all parts; CS 490 students do not need to do probabilistic record linking (Part 4c). Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells.
All of the data can be read via pandas. You may load either the
parquet or csv versions, but I recommend parquet. (You may need to
install pyarrow
to use that format.) You may either read
the data directly from the URLs above using pandas or download them
locally first. Before working on integration, we have some data cleaning
to do. First, the OECD dataset has the values as strings, and uses
..
to indicate that a value is unknown. Convert that
indicator to the NaN
value and convert the column to a
float. Second, the WB dataset states the value in US Dollars while the
units for the other two datasets are in millions of US
Dollars. Convert so that all datasets have the same units for the
values. Finally, all datasets are in a longer data format but are not
tidy as the variables for a given country-year are part of a single
observation so transform the variables column so that
those variables are separate attributes.
np.nan
The WB data has missing values for some years. In order to allow more
specific comparisons between different countries, we can estimate the
missing values based on the surrounding values. However, note that we
need to restrict the operation to the same country and
indicator (receipts or expenditures). Thus, we need to group by those
columns first, and then run imputation. However, we also don’t want to
add values outside of the reported range. For example, if Austria
reports values for 1995-2005 and 2015-2020, we can impute the missing
values from 2005-2015, but if it reports no values before 1995 or after
2020, we do not want to fill in values outside of that range but rather
leave them blank. Pandas provides the apply
method for
groups that allows a user-defined function that takes the grouped data
frame and returns a modified data frame. (Usually, we try avoid
apply
calls, but for some operations, it is necessary.)
interpolate
method has a parameter that is useful for only imputing values between
known valuesFor the rest of the assignment, we are going to focus on a few of the attributes that exist across the datasets, and a common schema will make our data fusion clearer. In this schema, we have the following attributes for each country-year entry:
countryCodeNum
: the numeric code for the countrycountryCodeAlpha
: the alphabetic code for the
countrycountry
: the name of the countryyear
: the year the amounts were recordedreceipts
: the amount of money spent in the country on
travelexpenditures
: the amount of money residents spend in
other countries while on travelsource
: the corresponding data source
("unwto", "wb", "oecd"
)Determine which columns in each dataset map to this common schema. Note that some datasets will not have (some of) the codes! Then, transform the three datasets to match this common schema. Keep the datasets separate (for now), but drop all columns except for the seven above.
We now wish to identify the same country in the different datasets.
We will standardize everything using the UN Population country names and
codes, which includes both alpahbetic and numeric country code
information. Before starting, it will be useful to create a new table
(country_codes
) that has just the country names and codes,
deduplicated. We will use three different approaches, one using known
codes, another using exact name matches, and the final one using
probabilistic record linkage. At the end, each row of the dataset should
have the countryName
, countryCodeAlpha
, and
countryCodeNum
columns filled. You may drop rows that
cannot be resolved (e.g. the regions like Europe in the World Bank
data).
The World Bank dataset includes alphabetic country codes that can be matched with the country codes. Join this dataset with the country names and both code columns from the population data. The UNWTO dataset contains the numeric country codes that can be matched with the population data. Join this dataset with the country names and both code columns from that population data. Note that the World Bank contains data that includes regions, not just countries. They can be excluded. The UNWTO data has a few outdated numeric codes. Resolve these either manually or using a similar technique from b or c.
For the OECD dataset, do merges with the UN population data based on exact country name matches. However, note that punctuation and capitalization can cause problems so before joining, make sure both columns have punctuation removed and all characters in a single case (lower or upper). Keep whitespace between components.
There is some variation in how names are written in the OECD and UN
Population datasets, and we expect this to be the case for the other
datasets as well. There are libraries that help with entity resolution
as well; one for python that uses probabilistic record linkage is fuzzymatcher, but you
might also use the Splink or
the Python
Record Linkage Toolkit. You can these using pip or conda/mamba
(e.g. pip install fuzzymatcher
or
conda install -c conda-forge fuzzymatcher
). Use one of them
to try to determine similar records by matching pairwise. You may also
need to fix a couple of country names manually.
link_table
method may be
useful because it gives both a rank and a score. We
usually want to take the top-ranked match, but if it is incorrect, the
other matches may be useful in resolving the names.Using the matches from the above parts, create one new dataframe that contains all of the merged information. After Parts a-c, the code columns should be filled for all three datasets. These codes (along with the year) will serve as the ids that link the same country together from each dataset.
At the end of Part 4, you should have a data frame with matching countries identified with the same country code (either numeric or alphabetic). Now, we will put all of the datasets together and come up with a single record for each country. To do this, we need to decide on a common name and coalesce the receipts and expeditures numbers.
Review the aggregate
method in pandas to find methods for choosing either the majority’s
opinion or the mean or median value. Null values are not included in the
mean/median calculation. Create a final dataset where each country-year
pair appears only once. It is ok if your data is not perfect, but it
should correctly use the linking methods and choose the correct
aggregate methods.
Join the fused dataset with the full UN population data and calculate per captia values for the receipts and expenditures (the average amount each resident received or spent related to travel). Add the following columns:
population
: the population (estimate) for the country
in the given yearperCapitaRcpts
: the amount of money received in the
country on a per capita basisperCapitaExpnds
: the amount of money spent on travel in
other countries on a per capita basisEach of the per capita columns is calculated by dividing the value by the population. Remember that if your values are in millions of US Dollars, you need to take that into consideration during the calculation.
An example snippet of the output is shown below.
country | countryCodeNum | countryCodeAlpha | year | receipts | expenditures | population | perCapitaRcpts | perCapitaExpnds | |
---|---|---|---|---|---|---|---|---|---|
12846 | United States of America | 840 | USA | 2012 | 195113.0 | 138167.0 | 315271.591 | 618.872761 | 438.247543 |
12847 | United States of America | 840 | USA | 2013 | 213105.0 | 132322.0 | 318031.051 | 670.076080 | 416.066292 |
12848 | United States of America | 840 | USA | 2014 | 222746.0 | 140228.0 | 320719.281 | 694.520140 | 437.229716 |
12849 | United States of America | 840 | USA | 2015 | 230574.0 | 144669.0 | 323348.646 | 713.081693 | 447.408708 |
12850 | United States of America | 840 | USA | 2016 | 228549.0 | 147640.0 | 325866.907 | 701.356889 | 453.068406 |
12851 | United States of America | 840 | USA | 2017 | 233758.0 | 158287.0 | 328553.489 | 711.476237 | 481.769347 |
12852 | United States of America | 840 | USA | 2018 | 241984.0 | 176607.0 | 331028.972 | 731.005502 | 533.509194 |
12853 | United States of America | 840 | USA | 2019 | 239256.0 | 185573.0 | 333251.103 | 717.945111 | 556.856371 |
12854 | United States of America | 840 | USA | 2020 | 84252.5 | 48183.5 | 335388.238 | 251.208869 | 143.664847 |
12855 | United States of America | 840 | USA | 2021 | 82974.0 | 73485.0 | 336495.769 | 246.582595 | 218.383132 |