Goals

The goal of this assignment is to work with the file system, concurrency, and basic data processing in Python.

Instructions

You will be doing your work in Python for this assignment. You may choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work. To use tiger, use the credentials you received. If you work remotely, make sure to download the .py files to turn in. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab. You may need to install some packages for this assignment: requests and pandas may be useful. Use the Navigator application or the command line conda install pandas requests to install them.

In this assignment, we will be working with files and threads. You will download a zip file, extract the archive, find particular files in the archive, and process each using threads.

Due Date

The assignment is due at 11:59pm on Friday, April 9.

Submission

You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a7.ipynb.

Details

Please make sure to follow instructions to receive full credit. Because you will be writing classes and adding to them, you do not need to separate each part of the assignment. Please document any shortcomings with your code. You may put the code for each part into one or more cells.

0. Name & Z-ID (5 pts)

The first cell of your notebook should be a markdown cell with a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.

1. Download & Extract Files (20 pts)

There is a zip file posted here, but you should download it using code in the notebook, not directly from this web page. The url is:

http://faculty.cs.niu.edu/~dakoop/cs503-2021sp/a7/archive.zip

To download this file, use the requests library. (This library is installed on tiger, but if you work locally, you may need to install it; conda install requests should work.) Note that this is a two-step process: first “get”-ing the file and then writing the response to a file. Consult the documentation. After downloading the archive, you should extract the entire zip file into the local directory. Remember that once you download and extract the file into the working directory, rerunning the code to test it may not work as expected because it will already be there. To check for this situation, write code to check (1) if the archive has already been downloaded before downloading it, and (2) if the extracted directory already exists before extracting it. You may also write code to delete those files once to check that your code works, but be careful that you do not delete other files in the process!

Hints
  • When writing the archive, you will probably want to open the file in binary write mode.
  • Use the exists method with os.path or pathlib.Path to check for path existence
  • Either zipfile or shutil will be useful in extracting the archive.
  • shutil’s rmtree can be used to delete a directory.

2. Find Matching Files (10 pts)

Now, write code to use the unzipped directory to find all files that end with the file extension .csv and contain the pattern 2021~03 where the ~ could be any delimiter (_ or - for example). Note that there may be directories that have that pattern, and that some files may reside in subdirectories (or sub-subdirectories). The list of file names should be paths that start with 'grocery-data', and there are five of them.

Hints
  • There is more than one way to accomplish this, depending on which libraries you use.
  • Remember that your code needs to check subdirectories (and sub-subdirectories).

3. Use Threads to Process Files (25 pts)

Finally, we are going to process all the files and extract the rows corresponding to 2021-03-31 using threads via concurrent.futures. Each comma-separated values (csv) file has the columns Date, Product, Price. Note that each file also has a header with these column names. For each file, extract those rows with the date 2021-03-31, and add a Filename column with the path to the file, and return them. Use a ThreadPoolExecutor to process all the matching files from Part 2. Take the results from each run, concatenate them together, and write them to a file named ‘2021-03-31-all.csv’. Consider using pandas or the csv library for the data processing steps. (Again, pandas is installed on tiger, but if you work locally, you may need to install it; conda install pandas should work.) Your final file should look something like this; the order of the rows may be different.

Hints
  • Consider writing the processing function first, then add threading.
  • pandas has methods to read and write csv files. The index argument can be useful if you don’t want to write the index.
  • pandas offers selection via boolean indexing similar to numpy.
  • Use concurrent.futures to run each thread, and consider the map function to wait for all results to complete