The goal of this assignment is to use pandas to perform some data cleaning tasks.
You should work on the assignment on your own local installation of
Jupyter and Python. You should use Python 3.8 or higher for your work.
When working locally, Anaconda is the easiest
way to install and manage Python, and you may launch Jupyter Lab either
from the Navigator application or via the command-line as
jupyter lab
.
In this assignment, we will clean the Ask
a Manager Salary Survey which is an open dataset anyone can add to
by filling out the associated form. I have downloaded the data
as a comma-separated values (csv) file and retitled the fields,
available here.
Note: All frameworks can read the .csv.gz
format directly (no need to decompress). The attributes of the data
are:
Timestamp
: the date the survey was completedAge
: age of participantIndustry
: field of workJobTitle
JobDetails
: extra information about the jobSalary
: annual salaryExtraComp
: extra compensationCurrency
: the currency of the salary (e.g. USD)CurrencyOther
: a currency not listed for the previous
fieldIncomeDetails
: extra information about incomeCountry
: the country the participant works inState
: if in the U.S., the state(s) the participant
works inCity
: the city the participant works inExpOverall
: overall years workingExpInField
: years working in current fieldEducation
: amount of educationGender
Race
The assignment is due at 11:59pm on Friday, March 10.
You should submit the single completed notebook file named
a3.ipynb
on Blackboard. Please use Markdown
headings to clearly separate your work in the notebook.
In the assignment, we will be transforming and cleaning the dataset from its raw form a more usable form. The set of changes include:
In Part 2, there are two different versions depending on whether you are in CSCI 490 or CSCI 503.
We wish to clean and convert the Salary
column to an
integer type. This task was required as a subtask in the previous
assignments, but now we wish to persist the changes. Note that the
Salary
column is reported as a string that
may contain commas. At the end of this task, you should have a data
frame that has the original columns with Salary
replaced
with the version as an integer (a float should also be ok).
Becuase the Age
, ExpOverall
, and
ExpInField
columns give ranges as strings, they are more
difficult to query. For example, if we wanted the average salary of all
those who had been working for 20 or less years, we would have to select
all of the age ranges that meet that criteria. Instead, we would like to
translate these ranges. First, convert any columns with “under”, “less
than” phrasing to use zero as the lower bound (0-17). Also, convert any
columns with “over”, “more than” phrasing to use 999 as the upper bound
(65-999). We can then use two numeric columns (2a) or a column of
intervals (2b) to simplify these queries. Students in CSCI 490 may do
either 2a or 2b. Students in CSCI 640 must do 2b.
If we split these columns into numeric values, we can use standard numeric comparisons instead. Convert each column into two separate columns, one for the start of the range and one for end of the range. Add 1 to the end of the range so we do not have any gaps if someone queries with a floating point value. For example, 18-24 becomes 18 and 25. At the end, you should have six columns instead of three (AgeRangeStart, AgeRangeEnd, ExpRangeStart, ExpRangeEnd, FieldExpRangeStart, FieldExpRangeEnd). You should be able to write a query like:
>= 18) & (df.AgeRangeEnd <= 45)] df[(df.AgeRangeStart
As in Part 2a, we can split the columns into numeric values, but
instead of creating two columns, we will create pd.Interval
objects. Add 1 to the end of the range so we do not have any gaps if
someone queries with a floating point value ([18,25)
instead of [18,24)
). A column of intervals supports queries
over intervals instead of requiring multiple boolean queries. Create an
interval for each range (e.g. [18,25)
). The range should be
closed on the left (not the default). These columns may
replace (use assign
) the original Age
,
ExpOverall
, and ExpInField
columns. You should
be able to write a query like:
18,45, closed='left')) df.Age.array.overlaps(pd.Interval(
.array
property allows direct access to the backing IntervalArray
for operations specific to intervals.pd.arrays.IntervalArray.from_arrays
to construct an interval array from separate left and right arrays.The suvery included a selectable list for some currencies as the
Currency
column, but that list also allowed an “Other”
option. If the participant selected "Other"
, they were
supposed to fill in the CurrencyOther
field. Rather than
having two columns for this information, we would like to have one
column that combines this information. In many cases, the participants
entered the three character currency code, but some entered the name of
the currency or some other value. Using the currency codes from the SIX
Group (Excel,
XML),
check for the codes that were entered in the CurrencyOther
column. Strip any whitespace and search for any codes even if they are
listed with additional text, extracting only the first one. A code
should be in all capital letters (A-Z) and exactly three characters
long, and should be separated from the other characters by a word
boundary (\b
). Then, combine the two currency columns
(Currency
and the extracted codes from
CurrencyOther
when Currency
is
"Other"
).
read_execl
method in pandas, but this will probably require installing the
xlrd
module.extract
method allows the use of a regular expression to extract a specified
pattern.{#}
syntax.where
method to combine two series based on a condition.Some salaries seem rather low. For example, if you run a query for
all of the salaries that are between 2 and 500 USD, restricting to US
currency, there are a number that don’t seem to make much sense. Those
at $0 or $1 seem more reasonable. What might be happening with these
salaries? State your reasoning on how what may be happening and devise a
method to clean these salarys. Run this method so that they are more in
line with the values we would expect to see, and update the
Salary
column accordingly.
Using the updated data frame (make sure the salary information was fixed in Parts 1 and 4), extract those jobs in the top five industries in the survey, and compute their average salary by state. You should produce a data frame where the columns are the five industries, and the rows are the states. Recall, however, that some respondants listed multiple states. Thus, you first need to unnest these states so that a respondant’s salary is used in the calculation for every state that was listed. This will require data transformation! An example output is shown below (although this does not have any Part 4 updates):
Industry | Computing or Tech | Education (Higher Education) | Government and Public Administration | Health care | Nonprofits |
---|---|---|---|---|---|
States | |||||
Alabama | 100240.000000 | 59901.450000 | 101750.000000 | 82167.200000 | 58632.666667 |
Alaska | 112500.000000 | 71767.875000 | 67080.000000 | 91013.333333 | 59720.000000 |
Arizona | 104242.105263 | 62746.102564 | 67860.357143 | 72963.242424 | 50446.666667 |
… | … | … | … | … | … |
West Virginia | 120500.000000 | 48611.111111 | 58178.000000 | 80005.333333 | 48912.500000 |
Wisconsin | 114268.000000 | 57807.112903 | 61441.148148 | 72431.106383 | 54346.882353 |
Wyoming | NaN | 54606.000000 | NaN | 87338.666667 | 95000.000000 |
reset_index
(may require two calls).