Assignment 3

Goals

The goal of this assignment is to use pandas to perform some data cleaning tasks.

Instructions

You should work on the assignment on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work. When working locally, Anaconda is the easiest way to install and manage Python, and you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter lab.

In this assignment, we will clean the Ask a Manager Salary Survey which is an open dataset anyone can add to by filling out the associated form. I have downloaded the data as a comma-separated values (csv) file and retitled the fields, available here. Note: All frameworks can read the .csv.gz format directly (no need to decompress). The attributes of the data are:

  1. Timestamp: the date the survey was completed
  2. Age: age of participant
  3. Industry: field of work
  4. JobTitle
  5. JobDetails: extra information about the job
  6. Salary: annual salary
  7. ExtraComp: extra compensation
  8. Currency: the currency of the salary (e.g. USD)
  9. CurrencyOther: a currency not listed for the previous field
  10. IncomeDetails: extra information about income
  11. Country: the country the participant works in
  12. State: if in the U.S., the state(s) the participant works in
  13. City: the city the participant works in
  14. ExpOverall: overall years working
  15. ExpInField: years working in current field
  16. Education: amount of education
  17. Gender
  18. Race

Due Date

The assignment is due at 11:59pm on Friday, March 10.

Submission

You should submit the single completed notebook file named a3.ipynb on Blackboard. Please use Markdown headings to clearly separate your work in the notebook.

Details

In the assignment, we will be transforming and cleaning the dataset from its raw form a more usable form. The set of changes include:

  • Reformatting the Salary column
  • Updating the Range columns
  • Combining and cleaning the Currency Columns
  • Cleaning the Salary Column
  • Transforming and calculating salary statistics for different industries by state

In Part 2, there are two different versions depending on whether you are in CSCI 490 or CSCI 503.

1. Reformatting the Salary column (10 pts)

We wish to clean and convert the Salary column to an integer type. This task was required as a subtask in the previous assignments, but now we wish to persist the changes. Note that the Salary column is reported as a string that may contain commas. At the end of this task, you should have a data frame that has the original columns with Salary replaced with the version as an integer (a float should also be ok).

2. Updating the Range Columns

Becuase the Age, ExpOverall, and ExpInField columns give ranges as strings, they are more difficult to query. For example, if we wanted the average salary of all those who had been working for 20 or less years, we would have to select all of the age ranges that meet that criteria. Instead, we would like to translate these ranges. First, convert any columns with “under”, “less than” phrasing to use zero as the lower bound (0-17). Also, convert any columns with “over”, “more than” phrasing to use 999 as the upper bound (65-999). We can then use two numeric columns (2a) or a column of intervals (2b) to simplify these queries. Students in CSCI 490 may do either 2a or 2b. Students in CSCI 640 must do 2b.

2a. [CSCI 490] Splitting the Range columns (15 pts)

If we split these columns into numeric values, we can use standard numeric comparisons instead. Convert each column into two separate columns, one for the start of the range and one for end of the range. Add 1 to the end of the range so we do not have any gaps if someone queries with a floating point value. For example, 18-24 becomes 18 and 25. At the end, you should have six columns instead of three (AgeRangeStart, AgeRangeEnd, ExpRangeStart, ExpRangeEnd, FieldExpRangeStart, FieldExpRangeEnd). You should be able to write a query like:

 df[(df.AgeRangeStart >= 18) & (df.AgeRangeEnd <= 45)]
Hints:
  • Split the string and then convert the components to integers

2b. [CSCI 503] Creating Range Interval Columns (15 pts)

As in Part 2a, we can split the columns into numeric values, but instead of creating two columns, we will create pd.Interval objects. Add 1 to the end of the range so we do not have any gaps if someone queries with a floating point value ([18,25) instead of [18,24)). A column of intervals supports queries over intervals instead of requiring multiple boolean queries. Create an interval for each range (e.g. [18,25)). The range should be closed on the left (not the default). These columns may replace (use assign) the original Age, ExpOverall, and ExpInField columns. You should be able to write a query like:

df.Age.array.overlaps(pd.Interval(18,45, closed='left'))
Hints
  • The .array property allows direct access to the backing IntervalArray for operations specific to intervals.
  • You can use pd.arrays.IntervalArray.from_arrays to construct an interval array from separate left and right arrays.

3. Combine and Clean the Currency Columns (20 pts)

The suvery included a selectable list for some currencies as the Currency column, but that list also allowed an “Other” option. If the participant selected "Other", they were supposed to fill in the CurrencyOther field. Rather than having two columns for this information, we would like to have one column that combines this information. In many cases, the participants entered the three character currency code, but some entered the name of the currency or some other value. Using the currency codes from the SIX Group (Excel, XML), check for the codes that were entered in the CurrencyOther column. Strip any whitespace and search for any codes even if they are listed with additional text, extracting only the first one. A code should be in all capital letters (A-Z) and exactly three characters long, and should be separated from the other characters by a word boundary (\b). Then, combine the two currency columns (Currency and the extracted codes from CurrencyOther when Currency is "Other").

Hints:
  • It it likely easier to use the read_execl method in pandas, but this will probably require installing the xlrd module.
  • pandas extract method allows the use of a regular expression to extract a specified pattern.
  • You can specify the number of characters to be repeated in a regular expression using {#} syntax.
  • You can use the where method to combine two series based on a condition.

4. Salary Cleaning (15 pts)

Some salaries seem rather low. For example, if you run a query for all of the salaries that are between 2 and 500 USD, restricting to US currency, there are a number that don’t seem to make much sense. Those at $0 or $1 seem more reasonable. What might be happening with these salaries? State your reasoning on how what may be happening and devise a method to clean these salarys. Run this method so that they are more in line with the values we would expect to see, and update the Salary column accordingly.

Hints:
  • What other ways may people report salary information? Think about the frequency they are paid or whether they might list their salaries with different units.

5. Salary by State (20 pts)

Using the updated data frame (make sure the salary information was fixed in Parts 1 and 4), extract those jobs in the top five industries in the survey, and compute their average salary by state. You should produce a data frame where the columns are the five industries, and the rows are the states. Recall, however, that some respondants listed multiple states. Thus, you first need to unnest these states so that a respondant’s salary is used in the calculation for every state that was listed. This will require data transformation! An example output is shown below (although this does not have any Part 4 updates):

Industry Computing or Tech Education (Higher Education) Government and Public Administration Health care Nonprofits
States
Alabama 100240.000000 59901.450000 101750.000000 82167.200000 58632.666667
Alaska 112500.000000 71767.875000 67080.000000 91013.333333 59720.000000
Arizona 104242.105263 62746.102564 67860.357143 72963.242424 50446.666667
West Virginia 120500.000000 48611.111111 58178.000000 80005.333333 48912.500000
Wisconsin 114268.000000 57807.112903 61441.148148 72431.106383 54346.882353
Wyoming NaN 54606.000000 NaN 87338.666667 95000.000000
Hints:
  • Depending on how you answered Part 4, your results should look different than the sample shown.
  • Are either of melt or pivot useful here?
  • When you transform the data, you may see an error about a unique index. One way around this is to use a numeric index obtained via reset_index (may require two calls).