Assignment 2

Goals

The goal of this assignment is to learn how to use pandas, DuckDB, and Ibis for data analysis tasks.

Instructions

You may choose to work on this assignment on a hosted environment (e.g. Google Colab) or on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab. For this assignment, you will need to make sure the pandas, python-duckdb, and ibis-framework packages are installed. From conda, this can be accomplished with conda install -c conda-forge pandas python-duckdb ibis-framework.

In this assignment, we will again analyze the Ask a Manager Salary Survey which is an open dataset anyone can add to by filling out the associated form. I have downloaded the data as a comma-separated values (csv) file and retitled the fields, available here. Note: All frameworks can read the .csv.gz format directly (no need to decompress). The attributes of the data are:

  1. Timestamp: the date the survey was completed
  2. Age: age of participant
  3. Industry: field of work
  4. JobTitle
  5. JobDetails: extra information about the job
  6. Salary: annual salary
  7. ExtraComp: extra compensation
  8. Currency: the currency of the salary (e.g. USD)
  9. CurrencyOther: a currency not listed for the previous field
  10. IncomeDetails: extra information about income
  11. Country: the country the participant works in
  12. State: if in the U.S., the state(s) the participant works in
  13. City: the city the participant works in
  14. ExpOverall: overall years working
  15. ExpInField: years working in current field
  16. Education: amount of education
  17. Gender
  18. Race

You will do the same analysis as in Assignment 1, but now using pandas, DuckDB, and Ibis. Make sure to properly label each problem’s code and solution. Use a Markdown cell to add a header denoting your work for a particular problem. Make sure to document what your answer is to the question, and make sure your code actually computes that.

Due Date

The assignment is due at 11:59pm on Friday, February 17.

Submission

You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a2.ipynb.

Details

CSCI 640 students must complete all three parts; CSCI 490 students may omit the third part but will receive extra credit if they complete it. You may choose to complete the assignment question by question or framework by framework.

1. Pandas

a. Maximum Number of States Selected (10 pts)

Read and parse the data file and store it in a dataframe. Find the job title and salary associated with the entry that has the most states selected.

Hints:
  • Pandas has a read_csv method that will also read .csv.gz files.
  • Pandas has a number of string methods which are helpful in processing a column of strings with a single command. Remember to add the str. prefix to the method.
  • Not only can you split the strings based on a delimiter, but you can also find the length of a list using string methods.
  • Pandas has an idxmax method that returns not only the count but the index of the row with the maximum value.

b. Number of Participants Working in Illinois (10 pts)

Count the number of participants that selected Illinois as a state where they worked. Important: the survey allows participants to select more than one state so the answer is more than 1208! Here, you will need to convert the list of strings into new rows before counting.

Hints:
  • Do not use loops
  • Consider the explode method

c. Highest US Salary (10 pts)

Compute the highest reported salary restricted to those salaries provided in US Dollars (those whose currency is “USD”). Note that the salaries are reported as strings not numbers so you will need to convert them before comparing.

Hints:

d. Latest Entry in 2021 (10 pts)

Find the salary associated with the last entry in 2021. You will need to parse the timestamp variable to obtain the year and compare individual dates.

Hints:
  • You can parse dates either when reading the csv (see options) or using to_datetime later.
  • assign can add or replace a column; it returns a new version of the dataframe.
  • Consider filtering followed by a computation of the maximum date

e. Top 10 Ways to Identify the U.S. (10 pts)

The dataset allows participants to enter the country where they work (Country) and then, if they live in the U.S., the state they work in. If a participant selects one or more states, we will assume they work in the U.S. Examining the Country field, we see there are many ways of writing this (e.g. “United States”, “USA”, “U.S.”). List the top 10 ways participants identify the U.S.

Hints:
  • Pandas has a method to count value occurrences
  • Remember to sort any value counts

2. DuckDB

DuckDB has a couple of APIs for interacting with databases: the basic API (docs which complies with the Python DB-API 2.0 and a relational API (docs). Please use the DB-API for this assignment. In addition, DuckDB supports transferring data to and from pandas dataframes. Again, for this assignment, all computation must be done via SQL, but you may display results as pandas dataframes. To connect to a new in-memory database, use duckdb.connect(). To execute a SQL statement, use duckdb.execute, and to retrieve any results, use .df() (or fetchall()/fetchone()) after the execute call. You should be able to write the solutions to all parts using one or two SQL statements.

a. Maximum Number of States Selected (10 pts)

Read in the data file to a new table. Find the job title and salary associated with the entry that has the most states selected.

Hints:
  • DuckDB has built-in support for CSV Loading, and you can store the data in a new table.
  • DuckDB has text functions to split strings up and nested functions to check the length of an array.
  • You might consider using a WITH clause to create temporary tables you can use in a following sql statement. Remember that you can add a new column using SELECT *, <EXPR>.
  • DuckDB has an arg_max function that will return a specified column’s value for the maximum over a different column

b. Number of Participants Working in Illinois (10 pts)

Count the number of participants that selected Illinois as a state where they worked. Important: the survey allows participants to select more than one state! The answer is more than 1208. In addition to splitting the string as in part a, you will need to unnest the column.

Hints:
  • DuckDB supports UNNEST and has a number of text functions
  • Since unnest changes cardinality (the number of tuples), you will likely need to use it in a sub-select.

c. Highest US Salary (10 pts)

Compute the highest reported salary restricted to those salaries provided in US Dollars (those whose currency is “USD”). Note that the salaries are reported as strings not numbers so you will need to convert them before comparing.

Hints:
  • Again, refer to the text functions to get rid of problematic characters.
  • You can convert a string to an integer by casting it. DuckDB also supports the TRY_CAST variant.

d. Latest Entry in 2021 (10 pts)

Find the salary associated with the last entry in 2021. You will need to parse the timestamp variable to obtain the year and compare individual dates.

Hints:
  • You cannot cast the given string to a timestamp, but you can use strptime with the correct format string to convert the values.
  • Consider using a WITH clause to create temporary tables you can use in a following sql statement. Remember that you can add a new column using SELECT *, <EXPR>.
  • Note that it’s not just the maximum, but the maximum before a given timestamp. (2022-01-01 00:00:00)

e. Top 10 Ways to Identify the U.S. (10 pts)

The dataset allows participants to enter the country where they work (Country) and then, if they live in the U.S., the state they work in. If a participant selects one or more states, we will assume they work in the U.S., but examining the Country field, we see there are many ways of writing this (e.g. “United States”, “USA”, “U.S.”). List the top 10 ways participants identify the U.S.

Hints:
  • Make sure you have a where clause.
  • To count by country, consider using the GROUP BY clause
  • To sort, consider using the ORDER BY clause

3. [CS640 Only] Ibis

Ibis is “a framework to access [and analyze data] from different sources in a standard way” using python methods. It uses DuckDB as its default backend but can also use pandas as well as many other databases and dataframe libraries. In theory, if you write your code and it works with one of the backends, it should work with others as well. However, this is still under active development so some operations are not yet supported (see the backend operations matrix). We will use the default DuckDB backend for this assignment. You will probably want to set ibis.options.interactive = True for this assignment so that results are immediately displayed.

a. Maximum Number of States Selected (10 pts)

Read in the data file. Find the job title and salary associated with the entry that has the most states selected.

Hints:
  • Ibis has similar string expressions that can be used to split strings
  • Ibis also has array methods that work on arrays.
  • Here, argmax is a method that works on the column whose value you want to return, passing the expression you want to compare as the argument.

b. Number of Participants Working in Illinois (10 pts)

Count the number of participants that selected Illinois as a state where they worked. Important: the survey allows participants to select more than one state! The answer is more than 1208. In addition to splitting the string as in part a, you will need to unnest the column.

  • Ibis has similar string expressions that can be used to split strings
  • Ibis also has an unnest operation.
  • Ibis has methods to filter and count values as well.

c. Highest US Salary (10 pts)

Compute the highest reported salary restricted to those salaries provided in US Dollars (those whose currency is “USD”). Note that the salaries are reported as strings not numbers so you will need to convert them before comparing.

Hints:
  • Ibis use the filter method to filter rows by some predicate.
  • Refer to string expressions to deal with the commas
  • You can cast columns to different types with Ibis using cast

d. Latest Entry in 2021 (10 pts)

Find the salary associated with the last entry in 2021. You will need to parse the timestamp variable to obtain the year and compare individual dates.

Hints:
  • Ibis has a to_timestamp that converts strings to dates but requires a format string
  • You can mutate a table in pandas similar to how assign works in pandas
  • Make sure to filter the timestamps before looking for the maximum (or using argmax)

e. Top 10 Ways to Identify the U.S. (10 pts)

The dataset allows participants to enter the country where they work (Country) and then, if they live in the U.S., the state they work in. If a participant selects one or more states, we will assume they work in the U.S., but examining the Country field, we see there are many ways of writing this (e.g. “United States”, “USA”, “U.S.”). List the top 10 ways participants identify the U.S.

Hints:
  • Don’t forget to filter.
  • Ibis can count values without grouping
  • Like SQL, Ibis has an order_by method but to obtain a descending order, it’s a bit different
  • Ibis also has the ability to limit the number of results to the number we wish