CSCI 640/490 – Assignment 3

Goals

The goal of this assignment is to use polars to perform some data cleaning tasks. Updated 2025-10-14: Part 2 added; deadline extended.

Instructions

You may choose to work on this assignment on a hosted environment (e.g. Google Colab) or on your own local installation of Jupyter and Python. You should use Python 3.12 or higher for your work. To install Python, consider using uv, miniforge, or Anaconda. Also, install JupyterLab (or Jupyter Notebook). For this assignment, you will need to make sure the polars package is installed. From conda, this can be accomplished with conda install -c conda-forge polars.

In this assignment, we will clean the Ask a Manager Salary Survey which is an open dataset anyone can add to by filling out the associated form. I have downloaded the data as a comma-separated values (csv) file and retitled the fields, available here. Note: All frameworks can read the .csv.gz format directly (no need to decompress). The attributes of the data are:

Timestamp: the date the survey was completed
Age: age of participant
Industry: field of work
JobTitle
JobDetails: extra information about the job
Salary: annual salary
ExtraComp: extra compensation
Currency: the currency of the salary (e.g. USD)
CurrencyOther: a currency not listed for the previous field
IncomeDetails: extra information about income
Country: the country the participant works in
State: if in the U.S., the state(s) the participant works in
City: the city the participant works in
ExpOverall: overall years working
ExpInField: years working in current field
Education: amount of education
Gender
Race

Due Date

The assignment is due at 11:59pm on ~~Friday, October 17~~ Tuesday, October 21.

Submission

You should submit the single completed notebook file named a3.ipynb on Blackboard. Please use Markdown headings to clearly separate your work in the notebook.

Details

In the assignment, we will be transforming and cleaning the dataset from its raw form a more usable form. The set of changes include:

Reformatting the Salary column
Updating the Range columns
Combining and cleaning the Currency Columns
Cleaning the Salary Column
Transforming and calculating salary statistics for different industries by state

1. Polars

a. Reformatting the Salary column (10 pts)

We wish to clean and convert the Salary column to an integer type. This task was required as a subtask in the previous assignments, but now we wish to persist the changes. Note that the Salary column is reported as a string that may contain commas. At the end of this task, you should have a data frame that has the original columns with Salary replaced with the version as an integer (a float should also be ok). Calculate the mean salary to make sure this works. (Note that this value is flawed due to currency differences).

b. Updating the Range Columns (20 pts)

Because the Age, ExpOverall, and ExpInField columns give ranges as strings, they are more difficult to query. For example, if we wanted the average salary of all those who had been working for 20 or less years, we would have to select all of the age ranges that meet that criteria. Instead, we would like to translate these ranges. First, convert any columns with “under”, “less than” phrasing to use zero as the lower bound (e.g., 0-17). Also, convert any columns with “over”, “more than” phrasing to use 999 as the upper bound (65-999). Then, for each of the three columns, convert the string to a list [<start>, <end>] where <start> is the beginning of the range and <end> is the end of the range and both are integers. For example, "18-24" becomes [18, 24]. You may assume that all ranges are inclusive. At the end of this task, you should have a data frame that has the original columns with Age, ExpOverall, and ExpInField replaced with lists of two integers. You should be able to write a query like:

dfrr.filter(pl.col('Age').list[0] <= 25)

Hints:

Split the string and then convert the components to integers
You can convert all of the columns in one call. Look at how pl.col works.

c. Combine and Clean the Currency Columns (20 pts)

The survey included a selectable list for some currencies as the Currency column, but that list also allowed an “Other” option. If the participant selected "Other", they were supposed to fill in the CurrencyOther field. Rather than having two columns for this information, we would like to have one column that combines this information. In many cases, the participants entered the three character currency code, but some entered the name of the currency or some other value. Using the currency codes from the SIX Group (Excel, XML), check for the codes that were entered in the CurrencyOther column. Strip any whitespace and search for any codes even if they are listed with additional text, extracting only the first one. A code should be in all capital letters (A-Z) and exactly three characters long, and should be separated from the other characters by a word boundary (\b). Then, combine the two currency columns (Currency and the extracted codes from CurrencyOther when Currency is "Other"). If the currency is “Other” and there is no valid code in CurrencyOther, then set the currency to null. At the end of this task, you should have a data frame that has the original columns with Currency replaced with the combined version and CurrencyOther removed.

Hints:

It is likely easier to use the read_excel method in polars, but this will probably require installing the fastexcel module.
polars extract method allows the use of a regular expression to extract a specified pattern.
You can specify the number of characters to be repeated in a regular expression using {#} syntax.
You can use the when or coalesce methods to combine two columns.

d. Salary Cleaning (15 pts)

Some salaries seem rather low. For example, if you run a query for all of the salaries that are between 2 and 500 USD, restricting to US currency, there are a number that don’t seem to make much sense. Those at $0 or $1 seem more reasonable. What might be happening with these salaries? State your reasoning on how what may be happening and devise a method to clean these salaries. Run this method so that they are more in line with the values we would expect to see, and update the Salary column accordingly.

Hints:

What other ways may people report salary information? Think about the frequency they are paid or whether they might list their salaries with different units.

e. Salary by State (20 pts)

Using the updated data frame (make sure the salary information was fixed in Parts a and d), extract those jobs in the top five industries in the survey, and compute their average salary by state. You should produce a data frame where the columns are the five industries, and the rows are the states. Recall, however, that some respondents listed multiple states. Thus, you first need to explode these states so that a respondent’s salary is used in the calculation for every state that was listed. This will require data transformation! An example output is shown below (although this does not have any Part d updates):

Industry	Computing or Tech	Education (Higher Education)	Government and Public Administration	Health care	Nonprofits
States
Alabama	100240.000000	59901.450000	101750.000000	82167.200000	58632.666667
Alaska	112500.000000	71767.875000	67080.000000	91013.333333	59720.000000
Arizona	104242.105263	62746.102564	67860.357143	72963.242424	50446.666667
…	…	…	…	…	…
West Virginia	120500.000000	48611.111111	58178.000000	80005.333333	48912.500000
Wisconsin	114268.000000	57807.112903	61441.148148	72431.106383	54346.882353
Wyoming	NaN	54606.000000	NaN	87338.666667	95000.000000

Hints:

Depending on how you answered Part d, your results should look different than the sample shown.
Are either of pivot or unpivot useful here?
You may need an aggregation function for the transformation. See examples.

2. Open Refine

Start by downloading OpenRefine. Documentation is available here. When your project is completed, select the Undo/Redo tab, click the “Extract…” button and click the “Export” button in dialog, saving the operation history as a3-part2.json. Turn in that file as your solution for Part 2. If you do the extra credit (trying to do the transform in Part 1e), please turn in a separate json file (a3-part2-ec.json).

a. Reformatting the Salary Column (10 pts)

Clean and convert the Salary column to an integer type. You will need to remove commas. Load the data and make sure to create a project once you verify the data is loading properly. You can do this with two commands in the “Edit Cells” submenu for the Salary column.

Hints:

Depending on your system, you may need to uncompress the .gz file first.

b. Updating the Range Columns (20 pts)

Use the Text Facet for tor the Age, ExpOverall, and ExpInField columns to see the possible values. Using this information, convert any columns with “under”, “less than” phrasing to use zero as the lower bound (e.g., 0-17). Also, convert any columns with “over”, “more than” phrasing to use 999 as the upper bound (65-999). You can do this directly using the “Edit” feature in the Facet panel.

Then, for each of the three columns, convert each to two new columns, *Start and *End using the Split into several columns... operation. So Age should become AgeStart and AgeEnd, each storing integers. You can split and then rename the columns individually.

c. [CSCI 640 Only] Combine and Clean the Currency Columns (20 pts)

The survey included a selectable list for some currencies as the Currency column, but that list also allowed an “Other” option. If the participant selected "Other", they were supposed to fill in the CurrencyOther field. Rather than having two columns for this information, we would like to have one column that combines this information. First, we need to check which other values have a valid currency. Rather than read in another list, here is a list of currencies to search for:

['KRW', 'ZAR', 'PLN', 'NOK', 'TRY', 'CZK', 'CHF', 'KWD', 'CNY', 'USD', 'SGD', 'NGN', 'EUR', 'TTD', 'GBP', 'MYR', 'AUD', 'LKR', 'MXN', 'SAR', 'INR', 'SEK', 'DKK', 'ILS', 'THB', 'PHP', 'BRL', 'IDR', 'COP', 'ARS', 'NZD', 'CAD']

You can use the filter function from OpenRefine’s General Refine Expression Language (GREL) to find the first matching currency in the CurrencyOther column. Create a Custom Transform that extracts the first matching currency in the CurrencyOther column. You can do this with GREL’s if function.

d. Salary Cleaning (15 pts)

Some salaries seem rather low. For example, if you run a query for all of the salaries that are between 2 and 500 USD, restricting to US currency, there are a number that don’t seem to make much sense. Those at $0 or $1 seem more reasonable. What might be happening with these salaries? State your reasoning on how what may be happening and devise a method to clean these salaries. Run this method so that they are more in line with the values we would expect to see, and update the Salary column accordingly.

e. Transform States (10 pts)

Split the states based on the comma delimiter; remember to check the structure here; are there any spaces? Then, use the split multi-valued cells so that each state is on its own row, and use “Fill Down” to fill in the rest of the columns.

Extra Credit

CSCI 490 students can do Part 2c for extra credit
Anyone can figure out how to get the same type of result as Part 1e in OpenRefine. Note that this may be quite difficult. Please submit a separate file for this one.