The goal of this assignment is to get acquainted with Python using Jupyter Notebooks.
You may choose to work on this assignment on a hosted environment
(e.g. Google Colab) or
on your own local installation of Jupyter and Python. You should use
Python 3.8 or higher for your work. If you choose to work locally, Anaconda is the easiest way to
install and manage Python, but other options like mambaforge also
work. If you work locally, you may launch Jupyter Lab either from the
Navigator application or via the command-line as
jupyter lab
.
In this assignment, we will analyze the Ask
a Manager Salary Survey which is an open dataset anyone can add to
by filling out the associated form. The survey
provides some information about the options available and types of
entries. Some entries are controlled by radio buttons or checkboxes
while others allow free text responses. I have downloaded the data
as a comma-separated values (csv) file and retitled the fields,
available here. You will do
some analysis of this data to answer some questions about it. I have
provided code to load the data, and it should be possible to answer
questions from the loaded data. In your solution, you may choose to
organize the cells as you wish, but you must properly label each
problem’s code and its solution. Use a Markdown cell to add a header
denoting your work for a particular problem. Make sure to document what
your answer is to the question, and make sure your code actually
computes that. As the goal of this assignment is to become acquainted
with core Python, do not use other libraries except for the
gzip
, csv
, datetime
, and
collections
modules.
You may start with the provided Jupyter Notebook, a1.ipynb. Download this notebook (right-click to save the link) and
upload it to your Jupyter workspace (either locally or on a hosted
environment). Make sure to execute the first cell in the notebook
(Shift+Enter). This cell will download the data and define the
rows
variable. The rows
variable is a list of
comma-delimited strings with the values of each field for each data
item. The field names are:
Timestamp
: the date the survey was completedAge
: age of participantIndustry
: field of workJobTitle
JobDetails
: extra information about the jobSalary
: annual salaryExtraComp
: extra compensationCurrency
: the currency of the salary (e.g. USD)CurrencyOther
: a currency not listed for the previous
fieldIncomeDetails
: extra information about incomeCountry
: the country the participant works inState
: if in the U.S., the state(s) the participant
works inCity
: the city the participant works inExpOverall
: overall years workingExpInField
: years working in current fieldEducation
: amount of educationGender
Race
Important: Accessing the data in this way requires parsing the CSV file according to rules governing double-quoted fields (fields that can have commas). It is not recommended to use the rudimentary method of splitting strings to access fields; use the csv library to parse this correctly!
The assignment is due at 11:59pm on Friday, February 3.
You should submit the completed notebook file required for this
assignment on Blackboard. The
filename of the notebook should be a1.ipynb
.
Read and parse the data file and store it in appropriate data structures. Count the number of participants that selected Illinois as a state where they worked. Important: the survey allows participants to select more than one state! The answer is more than 1208.
csv
library and its DictReader
class.Find the job title and salary associated with the entry that has the most states selected.
Write code that computes the highest reported salary restricted to those salaries provided in US Dollars (those whose currency is “USD”). Note that the salaries are reported as strings not numbers so you will need to convert them before comparing.
int("81")
returns an integer value of 81.Find the salary associated with the last entry in 2021. You will need to parse the timestamp variable to obtain the year and compare individual dates.
datetime
module to parse the dates
using the strptime method. See the mini-language here. Then, the timestamps can be
compared.The dataset allows participants to enter the country where they work
(Country
) and then, if they live in the U.S., the state
they work in. If a participant selects one or more states, we will
assume they work in the U.S., but examining the Country
field, we see there are many ways of writing this (e.g. “United States”,
“USA”, “U.S.”). List the top 10 ways participants identify the U.S.
Counter
class from the
collections
package.