Assignment 2

Goals

The goal of this assignment is to learn how to use polars and SQL (DuckDB) for data analysis tasks.

Instructions

You may choose to work on this assignment on a hosted environment (e.g. Google Colab) or on your own local installation of Jupyter and Python. You should use Python 3.12 or higher for your work. To install Python, consider using uv, miniforge, or Anaconda. Also, install JupyterLab (or Jupyter Notebook). For this assignment, you will need to make sure the polars, python-duckdb, and pandas packages are installed. From conda, this can be accomplished with conda install -c conda-forge polars python-duckdb pandas.

In this assignment, we will again analyze the Ask a Manager Salary Survey which is an open dataset anyone can add to by filling out the associated form. I have downloaded the data as a comma-separated values (csv) file and retitled the fields, available here. Note: All frameworks can read the .csv.gz format directly (no need to decompress). The attributes of the data are:

  1. Timestamp: the date the survey was completed
  2. Age: age of participant
  3. Industry: field of work
  4. JobTitle
  5. JobDetails: extra information about the job
  6. Salary: annual salary
  7. ExtraComp: extra compensation
  8. Currency: the currency of the salary (e.g. USD)
  9. CurrencyOther: a currency not listed for the previous field
  10. IncomeDetails: extra information about income
  11. Country: the country the participant works in
  12. State: if in the U.S., the state(s) the participant works in
  13. City: the city the participant works in
  14. ExpOverall: overall years working
  15. ExpInField: years working in current field
  16. Education: amount of education
  17. Gender
  18. Race

You will do the same analysis as in Assignment 1, but now using polars, DuckDB, and pandas. Make sure to properly label each problem’s code and solution. Use a Markdown cell to add a header denoting your work for a particular problem. Make sure to document what your answer is to the question, and make sure your code actually computes that.

Due Date

The assignment is due at 11:59pm on Thursday, September 25.

Submission

You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a2.ipynb.

Details

CSCI 640 students must complete all three parts; CSCI 490 students may omit the third part but will receive extra credit if they complete it. You may choose to complete the assignment question by question or framework by framework; just label each part accordingly. Do not use loops in your solutions; use native methods provided by each framework.

1. Polars

a. Maximum Number of States Selected (10 pts)

Read and parse the data file and store it in a dataframe. Find the job title and salary associated with the entry that has the most states selected.

Hints
  • Polars has a read_csv method that will also read .csv.gz files.
  • Polars has a number of string methods which are helpful in processing a column of strings with a single command. Remember to add the str. prefix to the method.
  • Not only can you split the strings based on a delimiter, but you can also find the length of a list using string methods.

b. Number of Participants Working in Illinois (10 pts)

Count the number of participants that selected Illinois as a state where they worked. Important: the survey allows participants to select more than one state so the answer is more than 1208! Here, you will need to convert the list of strings into new rows before counting.

Hints

c. Highest US Salary (10 pts)

Compute the highest reported salary restricted to those salaries provided in US Dollars (those whose currency is “USD”). Note that the salaries are reported as strings not numbers so you will need to convert them before comparing.

Hints
  • String methods will help to process a column of strings with a single command. Remember to add the str. prefix to the method.
  • You can cast between types using the cast method.

d. Latest Entry in 2021 (10 pts)

Find the salary associated with the last entry in 2021. You will need to parse the timestamp variable to obtain the year and compare individual dates.

Hints
  • You can parse dates using to_date but make sure to check the documentation for the correct date format specifiers.
  • Consider filtering followed by a computation of the maximum date

e. Top 10 Ways to Identify the U.S. (10 pts)

The dataset allows participants to enter the country where they work (Country) and then, if they live in the U.S., the state they work in. If a participant selects one or more states, we will assume they work in the U.S. Examining the Country field, we see there are many ways of writing this (e.g. “United States”, “USA”, “U.S.”). List the top 10 ways participants identify the U.S.

Hints
  • Pandas has a method to count value occurrences
  • Remember to sort any value counts

2. DuckDB

DuckDB has a couple of APIs for interacting with databases: the basic API (docs which (mostly) complies with the Python DB-API 2.0. (Technically, DuckDB’s DB API (docs) complies, but you may use .sql instead of .execute for the assignment). Please use the Basic or DB API for this assignment, not the Relational API (docs). In addition, DuckDB supports transferring data to and from dataframes. Again, for this assignment, all computation must be done via SQL, but you may display results as dataframes. To connect to a new in-memory database, use duckdb.connect(). To execute a SQL statement, use duckdb.sql, and to retrieve any results, use .pl() (or show()/df()/fetchall()/fetchone()) after the sql call. You should be able to write the solutions to all parts using one or two SQL statements.

a. Maximum Number of States Selected (10 pts)

Read in the data file to a new table. Find the job title and salary associated with the entry that has the most states selected.

Hints
  • DuckDB has built-in support for CSV Loading, and you can store the data in a new table.
  • DuckDB has text functions to split strings up and nested functions to check the length of an array.
  • You might consider using a WITH clause to create temporary tables you can use in a following sql statement. Remember that you can add a new column using SELECT *, <EXPR>.
  • DuckDB has an arg_max function that will return a specified column’s value for the maximum over a different column

b. Number of Participants Working in Illinois (10 pts)

Count the number of participants that selected Illinois as a state where they worked. Important: the survey allows participants to select more than one state! The answer is more than 1208. In addition to splitting the string as in part a, you will need to unnest the column.

Hints
  • DuckDB supports UNNEST and has a number of text functions
  • Since unnest changes cardinality (the number of tuples), you will likely need to use it in a sub-select.

c. Highest US Salary (10 pts)

Compute the highest reported salary restricted to those salaries provided in US Dollars (those whose currency is “USD”). Note that the salaries are reported as strings not numbers so you will need to convert them before comparing.

Hints
  • Again, refer to the text functions to get rid of problematic characters.
  • You can convert a string to an integer by casting it. DuckDB also supports the TRY_CAST variant.

d. Latest Entry in 2021 (10 pts)

Find the salary associated with the last entry in 2021. You will need to parse the timestamp variable to obtain the year and compare individual dates.

Hints
  • You cannot cast the given string to a timestamp, but you can use strptime with the correct format string to convert the values.
  • Consider using a WITH clause to create temporary tables you can use in a following sql statement. Remember that you can add a new column using SELECT *, <EXPR>.
  • Note that it’s not just the maximum, but the maximum before a given timestamp. (2022-01-01 00:00:00)

e. Top 10 Ways to Identify the U.S. (10 pts)

The dataset allows participants to enter the country where they work (Country) and then, if they live in the U.S., the state they work in. If a participant selects one or more states, we will assume they work in the U.S., but examining the Country field, we see there are many ways of writing this (e.g. “United States”, “USA”, “U.S.”). List the top 10 ways participants identify the U.S.

Hints
  • Make sure you have a where clause.
  • To count by country, consider using the GROUP BY clause
  • To sort, consider using the ORDER BY clause

3. Pandas [CSCI 640]

a. Maximum Number of States Selected (10 pts)

Read and parse the data file and store it in a dataframe. Find the job title and salary associated with the entry that has the most states selected.

Hints
  • Pandas has a read_csv method that will also read .csv.gz files.
  • Pandas has a number of string methods which are helpful in processing a column of strings with a single command. Remember to add the str. prefix to the method.
  • Not only can you split the strings based on a delimiter, but you can also find the length of a list using string methods.
  • Pandas has an idxmax method that returns not only the count but the index of the row with the maximum value.

b. Number of Participants Working in Illinois (10 pts)

Count the number of participants that selected Illinois as a state where they worked. Important: the survey allows participants to select more than one state so the answer is more than 1208! Here, you will need to convert the list of strings into new rows before counting.

Hints
  • Do not use loops
  • Consider the explode method

c. Highest US Salary (10 pts)

Compute the highest reported salary restricted to those salaries provided in US Dollars (those whose currency is “USD”). Note that the salaries are reported as strings not numbers so you will need to convert them before comparing.

Hints

d. Latest Entry in 2021 (10 pts)

Find the salary associated with the last entry in 2021. You will need to parse the timestamp variable to obtain the year and compare individual dates.

Hints
  • You can parse dates either when reading the csv (see options) or using to_datetime later.
  • assign can add or replace a column; it returns a new version of the dataframe.
  • Consider filtering followed by a computation of the maximum date

e. Top 10 Ways to Identify the U.S. (10 pts)

The dataset allows participants to enter the country where they work (Country) and then, if they live in the U.S., the state they work in. If a participant selects one or more states, we will assume they work in the U.S. Examining the Country field, we see there are many ways of writing this (e.g. “United States”, “USA”, “U.S.”). List the top 10 ways participants identify the U.S.

Hints
  • Pandas has a method to count value occurrences
  • Remember to sort any value counts