Assignment 1

Goals

The goal of this assignment is to get acquainted with Python using Jupyter Notebooks.

Instructions

You may choose to work on this assignment on a hosted environment (e.g. Google Colab) or on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work. If you choose to work locally, Anaconda is the easiest way to install and manage Python, but other options like mambaforge also work. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter lab.

In this assignment, we will analyze the Ask a Manager Salary Survey which is an open dataset anyone can add to by filling out the associated form. The survey provides some information about the options available and types of entries. Some entries are controlled by radio buttons or checkboxes while others allow free text responses. I have downloaded the data as a comma-separated values (csv) file and retitled the fields, available here. You will do some analysis of this data to answer some questions about it. I have provided code to load the data, and it should be possible to answer questions from the loaded data. In your solution, you may choose to organize the cells as you wish, but you must properly label each problem’s code and its solution. Use a Markdown cell to add a header denoting your work for a particular problem. Make sure to document what your answer is to the question, and make sure your code actually computes that. As the goal of this assignment is to become acquainted with core Python, do not use other libraries except for the gzip, csv, datetime, and collections modules.

You may start with the provided Jupyter Notebook, a1.ipynb. Download this notebook (right-click to save the link) and upload it to your Jupyter workspace (either locally or on a hosted environment). Make sure to execute the first cell in the notebook (Shift+Enter). This cell will download the data and define the rows variable. The rows variable is a list of comma-delimited strings with the values of each field for each data item. The field names are:

  1. Timestamp: the date the survey was completed
  2. Age: age of participant
  3. Industry: field of work
  4. JobTitle
  5. JobDetails: extra information about the job
  6. Salary: annual salary
  7. ExtraComp: extra compensation
  8. Currency: the currency of the salary (e.g. USD)
  9. CurrencyOther: a currency not listed for the previous field
  10. IncomeDetails: extra information about income
  11. Country: the country the participant works in
  12. State: if in the U.S., the state(s) the participant works in
  13. City: the city the participant works in
  14. ExpOverall: overall years working
  15. ExpInField: years working in current field
  16. Education: amount of education
  17. Gender
  18. Race

Important: Accessing the data in this way requires parsing the CSV file according to rules governing double-quoted fields (fields that can have commas). It is not recommended to use the rudimentary method of splitting strings to access fields; use the csv library to parse this correctly!

Due Date

The assignment is due at 11:59pm on Friday, February 3.

Submission

You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a1.ipynb.

Details

1. Number of Participants Working in Illinois (10 pts)

Read and parse the data file and store it in appropriate data structures. Count the number of participants that selected Illinois as a state where they worked. Important: the survey allows participants to select more than one state! The answer is more than 1208.

Hints:
  • Consider using the csv library and its DictReader class.
  • Python’s string methods may be useful in parsing the state field. When there are multiple states, they are separated by commas.

2. Maximum Number of States Selected (10 pts)

Find the job title and salary associated with the entry that has the most states selected.

Hints:
  • How can you determine how may different states are listed in an entry?
  • You can do comparisons based on one attribute but need to keep track of others when that attribute is maximal.

3. Highest US Salary (10 pts)

Write code that computes the highest reported salary restricted to those salaries provided in US Dollars (those whose currency is “USD”). Note that the salaries are reported as strings not numbers so you will need to convert them before comparing.

Hints:
  • Note that the numeric field is not consistent with some numbers listed with commas and others without. Consider removing the commas.
  • You can convert a string to an integer by casting it. For example, int("81") returns an integer value of 81.

4. Latest Entry in 2021 (15 pts)

Find the salary associated with the last entry in 2021. You will need to parse the timestamp variable to obtain the year and compare individual dates.

Hints:
  • You can use the datetime module to parse the dates using the strptime method. See the mini-language here. Then, the timestamps can be compared.

5. Top 10 Ways to Identify the U.S. (15 pts)

The dataset allows participants to enter the country where they work (Country) and then, if they live in the U.S., the state they work in. If a participant selects one or more states, we will assume they work in the U.S., but examining the Country field, we see there are many ways of writing this (e.g. “United States”, “USA”, “U.S.”). List the top 10 ways participants identify the U.S.

Hints:
  • Consider using the Counter class from the collections package.
  • Make sure that a participant has selected a state before including its country entry in the list.