Assignment 7

Goals

The goal of this assignment is to work with the file system, concurrency, and basic data processing in Python.

Instructions

You will be doing your work in a Jupyter notebook for this assignment. You may choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local installation of Jupyter and Python. You should use Python 3.12 for your work. (Older versions may work, but your code will be checked with Python 3.12.) To use tiger, use the credentials you received. If you work remotely, make sure to download the .ipynb file to turn in. If you choose to work locally, Anaconda or miniforge are probably the easiest ways to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application (anaconda) or via the command-line as jupyter-lab or jupyter lab. Also, if you work locally, you may need to install some packages for this assignment: requests (or aiohttp) and numpy or pandas or polars. You can conda install numpy polars pandas requests aiohttp to install them.

In this assignment, we will be working with files and threads. We will be using unemployment data from Illinois, available from the Illinois Department of Employment Security. I have downloaded the historical county data, and made it available on the course web site. You will download six zip files from there, extract the archives, load specific files from the archives, fix an issue with county names, and store the output data by county for some selected counties. You will use threading to process the data.

Due Date

The assignment is due at 11:59pm on Friday, November 22.

Submission

You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a7.ipynb.

Details

Please make sure to follow instructions to receive full credit. Because you will be writing classes and adding to them, you do not need to separate each part of the assignment. Please document any shortcomings with your code. You may put the code for each part into one or more cells. Note that CS 503 Students must use asyncio which is optional for CS 490 students.

0. Name & Z-ID (5 pts)

The first cell of your notebook should be a markdown cell with a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.

1. Download & Extract Files

There are six zip files posted on the course web site as https://faculty.cs.niu.edu/~dakoop/cs503-2024fa/a7/ that you will download using python. (You may also use the URL https://github.com/dakoop/cs503-2024fa-a7/raw/refs/heads/main/ if the faculty site is down.) Their filenames are unemp-<decade>.zip where <decade> is a decade from 1970 to 2020, inclusive. CSCI 503 students will use aiohttp for this task while CSCI 490 students may use the requests library. (If you have trouble with this part and wish to continue with other parts of the assignment, download the files manually.)

1a. [CSCI 490] Download & Extract Files (20 pts)

To download the files, use the requests library. (This library is installed on tiger, but if you work locally, you may need to install it; conda install requests should work.) Note that each download is a two-step process: first “get”-ing the file and then writing the response to a file. Consult the documentation. After downloading the archives, you should extract all six zip files into the local directory. Remember the DRY principle here. Also, once you download and extract the files into the working directory, rerunning the code to test it may not work as expected because they will already be there. To check for this situation, write code to check (1) if an archive has already been downloaded before downloading it, and (2) if the extracted directory already exists before extracting it. You may also write code to delete those files once to check that your code works, but be careful that you do not delete other files in the process!

Hints
  • Note that when unzipped, each archive will have a named that reflects the decade the data is from.
  • When writing the archive, you will probably want to open the file in binary write mode.
  • Use the exists method with os.path or pathlib.Path to check for path existence
  • Either zipfile or shutil will be useful in extracting the archive.
  • shutil’s rmtree can be used to delete a directory.

1b. [CSCI 503] Download & Extract Files (30 pts)

To download the files, use the aiohttp library and asyncio to download all the files. (This library is installed on tiger, but if you work locally, you may need to install it; conda install aiohttp should work. You may also need to install nest_asyncio via pip.) While you can refer to the example we showed in class, remember that you need to save each file locally after downloading it. This should be handled as a separate async coroutine, but note that the code in the method will be synchronous because operating systems generally do not support asynchronous file I/O. After downloading the archives, you should extract all six zip files into the local directory. This may be done synchronously! Remember the DRY principle here. Also, once you download and extract the files into the working directory, rerunning the code to test it may not work as expected because they will already be there. To check for this situation, write code to check (1) if an archive has already been downloaded before downloading it, and (2) if the extracted directory already exists before extracting it. You may also write code to delete those files once to check that your code works, but be careful that you do not delete other files in the process!

Hints
  • You will probably need to run import nest_asyncio; nest_asyncio.apply() in the notebook
  • Note that when unzipped, each archive will have a name that reflects the decade the data is from.
  • When writing the archive, you will probably want to open the file in binary write mode.
  • Use the exists method with os.path or pathlib.Path to check for path existence
  • Either zipfile or shutil will be useful in extracting the archive.
  • shutil’s rmtree can be used to delete a directory.

2. Find Matching Files (10 pts)

The zip files are structured differently depending on the years. Some counties have their data stored in the numpy format (.npy) while others use the CSV format (.csv). In addition, some counties have updated data that we want to use (not the original, older data). In these cases, there is a directory named mod or update in the second level of the zip file (e.g. data/1980/mod) that contains the files we need. For a given year (e.g. 1985.csv), if there is a file in a subdirectory of a mod or update directory, use it. If there is not a file in that subdirectory, use the original file. Also ignore files with extensions that are not .npy or .csv. Create a list of all the paths (as pathlib.Path objects) that will need to be processed. Note that some files may reside in subdirectories (or sub-subdirectories). Do not move the files into different directories; keep them where they were originally extracted!

Hints
  • There is more than one way to accomplish this, depending on which libraries you use.
  • Remember that your code needs to check subdirectories.
  • Think about how you might prioritize the mod or update versions of files when reading through lists of files.

3. Structural Pattern Matching to Process a File (20 pts)

We will use numpy and pandas or polars for the data processing steps. (Again, numpy, polars, and pandas are installed on tiger, but if you work locally, you may need to install them; conda install numpy pandas polars should work.)

You should write a function with a match statement to handle four cases:

  1. a npy file from a mod or update subdirectory
  2. a csv file from a mod or update subdirectory
  3. a npy file from a non-updated subdirectory
  4. a csv file from a non-updated subdirectory

For each Path p, perform the match using p.parts. You may use guards but you may not use if statements inside of the match cases!

For any npy file, load it using numpy.load and then convert it to a dataframe. For any csv file, load it using the appropriate method from pandas or polars. For any non-updated file, you will need to convert the RATE column by multiplying it by 100.

Besides reading the files, we will need to replace some of the county names to agree on a common scheme. The file has the columns COUNTY, FIPS, YEAR, LABOR_FORCE, EMPLOYED, UNEMPLOYED_NUMBER, and RATE. Note that the file also has a header with these column names. For each file, extract only the columns COUNTY, YEAR, LABOR_FORCE, EMPLOYED, and UNEMPLOYED_NUMBER. Then,

  1. Convert all values in the COUNTY column to upper-case.
  2. Filter the rows to include only DEKALB, KANE, BOONE, MCHENRY, WINNEBAGO, OGLE, LEE, and KENDALL counties.
  3. Add a new CALC_RATE column by dividing UNEMPLOYED_NUMBER by LABOR_FORCE and multiplying it by 100.
Hints
  • Both pandas and polars have methods to read and write csv files, although they have different names.
  • Both pandas and polars has str methods like upper and to_uppercase that should be useful.
  • The isin/is_in method can be useful for checking if values match one of the values in a container like a list.

4. Use Threads to Process Files (30 pts)

Finally, we are going to process all the files using threads via concurrent.futures. UUse a ThreadPoolExecutor to run the function from Part 3 for each matching file from Part 2. Take the results from each run, and concatenate them together.

From the single concatenated dataframe, create eight dataframes, one for each of the specified counties, and write eight files with the county name (e.g. DEKALB.csv.gz) which contain only the records from the specified county. The .gz extension means that the file will be compressed using the gzip algorithm to make the output smaller (pandas and polars will do this automatically if you specify that file extension).

Hints
  • Consider writing the processing function first, then add threading.
  • Use concurrent.futures to run each thread, and consider the map function to wait for all results to complete

Check Results

While this does not guarantee correct answers, the following code to compute the average unemployment rate over the past 40+ years should produce the corresponding results (replace pd with pl if using polars):

counties_no_suffix = ['DEKALB', 'KANE', 'BOONE', 'MCHENRY', 'WINNEBAGO', 'OGLE', 'LEE', 'KENDALL'] 
for c in sorted(counties_no_suffix):
    cdf = pd.read_csv(f'{c}.csv.gz')
    print(c, cdf["CALC_RATE"].mean())
BOONE 8.054294578004583
DEKALB 5.721435538644467
KANE 6.40548752171531
KENDALL 5.457945584277057
LEE 6.260036908826006
MCHENRY 5.87830091101972
OGLE 6.60481888833855
WINNEBAGO 7.695749477329302