The goal of this assignment is to work with the file system, concurrency, and basic data processing in Python.
You will be doing your work in Python for this assignment. You may
choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local
installation of Jupyter and Python. You should use Python 3.9 or higher
for your work. To use tiger, use the credentials you received. If you
work remotely, make sure to download the .py files to turn in. If you
choose to work locally, Anaconda is the easiest way
to install and manage Python. If you work locally, you may launch
Jupyter Lab either from the Navigator application or via the
command-line as jupyter-lab
. You may need to install some
packages for this assignment: aiohttp (or requests) and pandas. Use the
Navigator application or the command line
conda install pandas requests
to install them.
In this assignment, we will be working with files and threads. We will be using unemployment data from Illinois, available from the Illinois Department of Employment Security. I have downloaded the historical county data, and made it available on the course web site. You will download six zip files from there, extract the archives, load specific files from the archives, fix an issue with county names, and store it by county for some selected counties. You will use threading to download and process the data.
The assignment is due at 11:59pm on Friday, April 22.
You should submit the completed notebook file required for this
assignment on Blackboard. The
filename of the notebook should be a7.ipynb
.
Please make sure to follow instructions to receive full credit. Because you will be writing classes and adding to them, you do not need to separate each part of the assignment. Please document any shortcomings with your code. You may put the code for each part into one or more cells. Note that CS 503 Students must use asyncio which is optional for CS 490 students.
The first cell of your notebook should be a markdown cell with a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.
There are six zip files posted on the course web
site that you will download using python. Their filenames are
unemp-<decade>.zip
where <decade>
is a decade from 1970 to 2020, inclusive. CSCI 503 students will use aiohttp for this task
while CSCI 490 students may use the requests library. (If you have
trouble with this part and wish to continue with other parts of the
assignment, download the files manually.)
To download the files, use the requests library. (This library
is installed on tiger, but if you work locally, you may need to install
it; conda install requests
should work.) Note that each
download is a two-step process: first “get”-ing the file and then
writing the response to a file. Consult the documentation.
After downloading the archives, you should extract all six zip files
into the local directory. Remember the DRY principle here. Also, once
you download and extract the files into the working directory, rerunning
the code to test it may not work as expected because they will already
be there. To check for this situation, write code to check (1) if an
archive has already been downloaded before downloading it, and (2) if
the extracted directory already exists before extracting it. You may
also write code to delete those files once to check that your code
works, but be careful that you do not delete other
files in the process!
exists
method with os.path or pathlib.Path to
check for path existenceTo download the files, use the aiohttp library and
asyncio to download all the files. (This library is installed on tiger,
but if you work locally, you may need to install it;
conda install aiohttp
should work. You may also need to
install nest_asyncio
via pip.) While you can refer to the
example
we showed in class, remember that you need to save each
file locally after downloading it. This should be handled as a separate
async coroutine, but note that the code in the method will be
synchronous because operating systems generally do not support
asynchronous file I/O. After downloading the archives, you should
extract all six zip files into the local directory. This may be done
synchronously! Remember the DRY principle here. Also,
once you download and extract the files into the working directory,
rerunning the code to test it may not work as expected because they will
already be there. To check for this situation, write code to check (1)
if an archive has already been downloaded before downloading it, and (2)
if the extracted directory already exists before extracting it. You may
also write code to delete those files once to check that your code
works, but be careful that you do not delete other
files in the process!
import nest_asyncio; nest_asyncio.apply()
in the
notebookexists
method with os.path or pathlib.Path to
check for path existenceNow, write code to find all files in the unzipped directories that
end with the file extension .csv
. Note that each directory
can have a different name for the files (e.g. employment
or
unemp
) and there are two different formats of the files
(csv and xlsx). You should find six csv files in the various
subdirectories, and your code should find the paths to those files.
Finally, we are going to process all the files using threads via concurrent.futures.
Use pandas for the data
processing steps. (Again, pandas is installed on tiger, but if you work
locally, you may need to install it; conda install pandas
should work.) Besides reading the files, we will need to replace some of
the county names to agree on a common scheme. Each comma-separated
values (csv) file has the columns COUNTY
,
FIPS
, YEAR
, LABOR_FORCE
,
EMPLOYED
, UNEMPLOYED_NUMBER
, and
RATE
. Note that each file also has a header with these
column names. For each file, extract only the columns
COUNTY
, YEAR
, LABOR_FORCE
,
EMPLOYED
, and UNEMPLOYED_NUMBER
. Then,
COUNTY
column to
upper-case.DEKALB
, KANE
, BOONE
,
MCHENRY
, WINNEBAGO
, OGLE
,
LEE
, and KENDALL
counties.RATE
column by dividing
UNEMPLOYED_NUMBER
by LABOR_FORCE
.Use a ThreadPoolExecutor to process (read, convert, filter, and recompute) each matching file from Part 2. Take the results from each run, and concatenate them together.
From the single concatenated dataframe, create eight dataframes, one
for each county, and write eight files with the county name
(e.g. DEKALB.csv.gz
) which contain only the records from
the specified county. The .gz
extension means that the file
will be compressed using the gzip algorithm to make the output smaller
(pandas will do this automatically if you specify that file
extension).
index
argument can be useful if
you don’t want to write the index.upper
that should be useful.isin
method can be useful for checking if values
match one of the values in a container like a list.While this does not guarantee correct answers, the following code to compute the average unemployment rate over the past 40+ years should produce the corresponding results:
= ['DEKALB', 'KANE', 'BOONE', 'MCHENRY', 'WINNEBAGO', 'OGLE', 'LEE', 'KENDALL']
counties_no_suffix for c in sorted(counties_no_suffix):
= pd.read_csv(f'{c}.csv.gz')
cdf print(c, cdf.RATE.mean())
BOONE 0.08099938784155404
DEKALB 0.05766694572727062
KANE 0.06465744592318969
KENDALL 0.05521362047852946
LEE 0.06341724493660843
MCHENRY 0.059635399403897664
OGLE 0.0666058153499312
WINNEBAGO 0.07757288231143362