The goal of this assignment is to learn how to use pandas, DuckDB, and Ibis for data analysis tasks.
You may choose to work on this assignment on a hosted environment
(e.g. Google Colab) or
on your own local installation of Jupyter and Python. You should use
Python 3.8 or higher for your work. If you choose to work locally, Anaconda is the easiest
way to install and manage Python. If you work locally, you may launch
Jupyter Lab either from the Navigator application or via the
command-line as jupyter-lab
. For this assignment, you will
need to make sure the pandas, python-duckdb, and ibis-framework packages
are installed. From conda, this can be accomplished with
conda install -c conda-forge pandas python-duckdb ibis-framework
.
In this assignment, we will again analyze the Ask
a Manager Salary Survey which is an open dataset anyone can add to
by filling out the associated form. I have downloaded the data
as a comma-separated values (csv) file and retitled the fields,
available here.
Note: All frameworks can read the .csv.gz
format directly (no need to decompress). The attributes of the data
are:
Timestamp
: the date the survey was completedAge
: age of participantIndustry
: field of workJobTitle
JobDetails
: extra information about the jobSalary
: annual salaryExtraComp
: extra compensationCurrency
: the currency of the salary (e.g. USD)CurrencyOther
: a currency not listed for the previous
fieldIncomeDetails
: extra information about incomeCountry
: the country the participant works inState
: if in the U.S., the state(s) the participant
works inCity
: the city the participant works inExpOverall
: overall years workingExpInField
: years working in current fieldEducation
: amount of educationGender
Race
You will do the same analysis as in Assignment 1, but now using pandas, DuckDB, and Ibis. Make sure to properly label each problem’s code and solution. Use a Markdown cell to add a header denoting your work for a particular problem. Make sure to document what your answer is to the question, and make sure your code actually computes that.
The assignment is due at 11:59pm on Friday, February 17.
You should submit the completed notebook file required for this
assignment on Blackboard. The
filename of the notebook should be a2.ipynb
.
CSCI 640 students must complete all three parts; CSCI 490 students may omit the third part but will receive extra credit if they complete it. You may choose to complete the assignment question by question or framework by framework.
Read and parse the data file and store it in a dataframe. Find the job title and salary associated with the entry that has the most states selected.
read_csv
method that will also read .csv.gz
files.str.
prefix to
the method.idxmax
method that returns not only the count but the index of the row with the
maximum value.Count the number of participants that selected Illinois as a state where they worked. Important: the survey allows participants to select more than one state so the answer is more than 1208! Here, you will need to convert the list of strings into new rows before counting.
explode
methodCompute the highest reported salary restricted to those salaries provided in US Dollars (those whose currency is “USD”). Note that the salaries are reported as strings not numbers so you will need to convert them before comparing.
query
astype
Find the salary associated with the last entry in 2021. You will need to parse the timestamp variable to obtain the year and compare individual dates.
to_datetime
later.assign
can add or replace a column; it returns a new version of the
dataframe.The dataset allows participants to enter the country where they work
(Country
) and then, if they live in the U.S., the state
they work in. If a participant selects one or more states, we will
assume they work in the U.S. Examining the Country
field,
we see there are many ways of writing this (e.g. “United States”, “USA”,
“U.S.”). List the top 10 ways participants identify the U.S.
DuckDB has a couple of APIs for interacting with databases: the basic
API (docs
which complies with the Python DB-API 2.0
and a relational API (docs).
Please use the DB-API for this assignment. In addition, DuckDB supports
transferring data to and from pandas dataframes. Again, for this
assignment, all computation must be done via SQL, but you may display
results as pandas dataframes. To connect to a new
in-memory database, use duckdb.connect()
. To execute a SQL
statement, use duckdb.execute
, and to retrieve any results,
use .df()
(or
fetchall()
/fetchone()
) after the execute call.
You should be able to write the solutions to all parts using one or two
SQL statements.
Read in the data file to a new table. Find the job title and salary associated with the entry that has the most states selected.
SELECT *, <EXPR>
.arg_max
function that will return a specified column’s value for the maximum
over a different columnCount the number of participants that selected Illinois as a state where they worked. Important: the survey allows participants to select more than one state! The answer is more than 1208. In addition to splitting the string as in part a, you will need to unnest the column.
Compute the highest reported salary restricted to those salaries provided in US Dollars (those whose currency is “USD”). Note that the salaries are reported as strings not numbers so you will need to convert them before comparing.
TRY_CAST
variant.Find the salary associated with the last entry in 2021. You will need to parse the timestamp variable to obtain the year and compare individual dates.
strptime
with the correct format string to convert the values.SELECT *, <EXPR>
.2022-01-01 00:00:00
)The dataset allows participants to enter the country where they work
(Country
) and then, if they live in the U.S., the state
they work in. If a participant selects one or more states, we will
assume they work in the U.S., but examining the Country
field, we see there are many ways of writing this (e.g. “United States”,
“USA”, “U.S.”). List the top 10 ways participants identify the U.S.
GROUP BY
clauseORDER BY
clauseIbis is “a framework to access [and analyze data] from different
sources in a standard way” using python methods. It uses DuckDB as its
default backend but can also use pandas as well as many other databases
and dataframe libraries. In theory, if you write your code and it works
with one of the backends, it should work with others as well. However,
this is still under active development so some operations are not yet
supported (see the backend
operations matrix). We will use the default DuckDB backend for this
assignment. You will probably want to set
ibis.options.interactive = True
for this assignment so that
results are immediately displayed.
Read in the data file. Find the job title and salary associated with the entry that has the most states selected.
argmax
is a method that works on the column whose
value you want to return, passing the expression you want to compare as
the argument.Count the number of participants that selected Illinois as a state where they worked. Important: the survey allows participants to select more than one state! The answer is more than 1208. In addition to splitting the string as in part a, you will need to unnest the column.
unnest
operation.Compute the highest reported salary restricted to those salaries provided in US Dollars (those whose currency is “USD”). Note that the salaries are reported as strings not numbers so you will need to convert them before comparing.
filter
method to filter rows by some
predicate.cast
Find the salary associated with the last entry in 2021. You will need to parse the timestamp variable to obtain the year and compare individual dates.
to_timestamp
that converts strings to dates but requires a format stringmutate
a table in pandas similar to how
assign
works in pandasThe dataset allows participants to enter the country where they work
(Country
) and then, if they live in the U.S., the state
they work in. If a participant selects one or more states, we will
assume they work in the U.S., but examining the Country
field, we see there are many ways of writing this (e.g. “United States”,
“USA”, “U.S.”). List the top 10 ways participants identify the U.S.
order_by
method but to obtain a
descending order, it’s a bit
different