The goal of this assignment is to learn how to use polars and SQL (DuckDB) for data analysis tasks.
You may choose to work on this assignment on a hosted environment
(e.g. Google Colab) or
on your own local installation of Jupyter and Python. You should use
Python 3.12 or higher for your work. To install Python, consider using
uv, miniforge, or Anaconda. Also, install JupyterLab (or Jupyter
Notebook). For this assignment, you will need to make sure the polars,
python-duckdb, and pandas packages are installed. From conda, this can
be accomplished with
conda install -c conda-forge polars python-duckdb pandas
.
In this assignment, we will again analyze the Ask
a Manager Salary Survey which is an open dataset anyone can add to
by filling out the associated form. I have downloaded the data
as a comma-separated values (csv) file and retitled the fields,
available here.
Note: All frameworks can read the .csv.gz
format directly (no need to decompress). The attributes of the data
are:
Timestamp
: the date the survey was completedAge
: age of participantIndustry
: field of workJobTitle
JobDetails
: extra information about the jobSalary
: annual salaryExtraComp
: extra compensationCurrency
: the currency of the salary (e.g. USD)CurrencyOther
: a currency not listed for the previous
fieldIncomeDetails
: extra information about incomeCountry
: the country the participant works inState
: if in the U.S., the state(s) the participant
works inCity
: the city the participant works inExpOverall
: overall years workingExpInField
: years working in current fieldEducation
: amount of educationGender
Race
You will do the same analysis as in Assignment 1, but now using polars, DuckDB, and pandas. Make sure to properly label each problem’s code and solution. Use a Markdown cell to add a header denoting your work for a particular problem. Make sure to document what your answer is to the question, and make sure your code actually computes that.
The assignment is due at 11:59pm on Thursday, September 25.
You should submit the completed notebook file required for this
assignment on Blackboard. The
filename of the notebook should be a2.ipynb
.
CSCI 640 students must complete all three parts; CSCI 490 students may omit the third part but will receive extra credit if they complete it. You may choose to complete the assignment question by question or framework by framework; just label each part accordingly. Do not use loops in your solutions; use native methods provided by each framework.
Read and parse the data file and store it in a dataframe. Find the job title and salary associated with the entry that has the most states selected.
read_csv
method that will also read .csv.gz
files.str.
prefix to
the method.Count the number of participants that selected Illinois as a state where they worked. Important: the survey allows participants to select more than one state so the answer is more than 1208! Here, you will need to convert the list of strings into new rows before counting.
explode
method
or expression
methodCompute the highest reported salary restricted to those salaries provided in US Dollars (those whose currency is “USD”). Note that the salaries are reported as strings not numbers so you will need to convert them before comparing.
str.
prefix to the
method.cast
method.Find the salary associated with the last entry in 2021. You will need to parse the timestamp variable to obtain the year and compare individual dates.
to_date
but make sure to check the documentation for the correct date format
specifiers.The dataset allows participants to enter the country where they work
(Country
) and then, if they live in the U.S., the state
they work in. If a participant selects one or more states, we will
assume they work in the U.S. Examining the Country
field,
we see there are many ways of writing this (e.g. “United States”, “USA”,
“U.S.”). List the top 10 ways participants identify the U.S.
DuckDB has a couple of APIs for interacting with databases: the basic
API (docs
which (mostly) complies with the Python DB-API 2.0.
(Technically, DuckDB’s DB API (docs)
complies, but you may use .sql
instead of
.execute
for the assignment). Please use the Basic or DB
API for this assignment, not the Relational API (docs).
In addition, DuckDB supports transferring data to and from dataframes.
Again, for this assignment, all computation must be done via SQL, but
you may display results as dataframes. To connect to a
new in-memory database, use duckdb.connect()
. To execute a
SQL statement, use duckdb.sql
, and to retrieve any results,
use .pl()
(or
show()/df()/fetchall()
/fetchone()
) after the
sql call. You should be able to write the solutions to all parts using
one or two SQL statements.
Read in the data file to a new table. Find the job title and salary associated with the entry that has the most states selected.
SELECT *, <EXPR>
.arg_max
function that will return a specified column’s value for the maximum
over a different columnCount the number of participants that selected Illinois as a state where they worked. Important: the survey allows participants to select more than one state! The answer is more than 1208. In addition to splitting the string as in part a, you will need to unnest the column.
Compute the highest reported salary restricted to those salaries provided in US Dollars (those whose currency is “USD”). Note that the salaries are reported as strings not numbers so you will need to convert them before comparing.
TRY_CAST
variant.Find the salary associated with the last entry in 2021. You will need to parse the timestamp variable to obtain the year and compare individual dates.
strptime
with the correct format string to convert the values.SELECT *, <EXPR>
.2022-01-01 00:00:00
)The dataset allows participants to enter the country where they work
(Country
) and then, if they live in the U.S., the state
they work in. If a participant selects one or more states, we will
assume they work in the U.S., but examining the Country
field, we see there are many ways of writing this (e.g. “United States”,
“USA”, “U.S.”). List the top 10 ways participants identify the U.S.
GROUP BY
clauseORDER BY
clauseRead and parse the data file and store it in a dataframe. Find the job title and salary associated with the entry that has the most states selected.
read_csv
method that will also read .csv.gz
files.str.
prefix to
the method.idxmax
method that returns not only the count but the index of the row with the
maximum value.Count the number of participants that selected Illinois as a state where they worked. Important: the survey allows participants to select more than one state so the answer is more than 1208! Here, you will need to convert the list of strings into new rows before counting.
explode
methodCompute the highest reported salary restricted to those salaries provided in US Dollars (those whose currency is “USD”). Note that the salaries are reported as strings not numbers so you will need to convert them before comparing.
query
astype
Find the salary associated with the last entry in 2021. You will need to parse the timestamp variable to obtain the year and compare individual dates.
to_datetime
later.assign
can add or replace a column; it returns a new version of the
dataframe.The dataset allows participants to enter the country where they work
(Country
) and then, if they live in the U.S., the state
they work in. If a participant selects one or more states, we will
assume they work in the U.S. Examining the Country
field,
we see there are many ways of writing this (e.g. “United States”, “USA”,
“U.S.”). List the top 10 ways participants identify the U.S.