The goal of this assignment is to work with the file system, concurrency, and basic data processing in Python.
You will be doing your work in Python for this assignment. You may
choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local
installation of Jupyter and Python. You should use Python 3.12 for your
work, but other recent versions should also work. To use tiger, use the
credentials you received. If you choose to work locally, Anaconda or miniforge are
probably the easiest ways to install and manage Python. If you work
locally, you may need to install some packages for this assignment:
aiohttp (or requests) and pandas. Use the Navigator application or the
command line conda install pandas requests
to install
them.
In this assignment, we will be working with files and threads. We will be using energy usage data from New York state, available on the Utility Energy Registry. I have downloaded a subset of the data, and made it available on the course web site. You will download zip files from there, extract the archives, load files from the archives, improve the specification of missing data, and store it by year. You will use threading to download and process the data.
The assignment is due at 11:59pm on Friday, April 18.
You should submit the completed notebook file required for this
assignment on Blackboard. The
filename of the notebook should be a7.ipynb
.
Please make sure to follow instructions to receive full credit. Because you will be writing classes and adding to them, you do not need to separate each part of the assignment. Please document any shortcomings with your code. You may put the code for each part into one or more cells. Note that CS 503 Students must use asyncio which is optional for CS 490 students.
The first cell of your notebook should be a markdown cell with a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.
There are a bunch of zip files posted on the course web
site that you will download using python. Their filenames are
ab.zip
, c.zip
, df.zip
,
gk.zip
, ln.zip
, o.zip
,
pr.zip
, s.zip
, tv.zip
, and
wz.zip
, signifying the first letters of the counties
contained in the files. CSCI 503 students will use aiohttp for this task
while CSCI 490 students may use the requests library. (If you have
trouble with this part and wish to continue with other parts of the
assignment, download the files manually.)
To download the files, use the requests library. (This library
is installed on tiger, but if you work locally, you may need to install
it; conda install requests
should work.) Note that each
download is a two-step process: first “get”-ing the file and then
writing the response to a file. Consult the documentation.
After downloading the archives, you should extract all the zip files
into the local directory. Remember the DRY principle here. Also, once
you download and extract the files into the working directory, rerunning
the code to test it may not work as expected because they will already
be there. To check for this situation, write code to check (1) if an
archive has already been downloaded before downloading it, and (2) if
the extracted directory already exists before extracting it. You may
also write code to delete those files once to check that your code
works, but be careful that you do not delete other
files in the process!
data
.exists
method with os.path or pathlib.Path to
check for path existenceTo download the files, use the aiohttp library and
asyncio to download all the files. (This library is installed on tiger,
but if you work locally, you may need to install it;
conda install aiohttp
should work. You may also need to
install nest_asyncio
via pip.) While you can refer to the
example
we showed in class, remember that you need to save each
file locally after downloading it. This should be handled as a separate
async coroutine, but note that the code in the method will be
synchronous because operating systems generally do not support
asynchronous file I/O. After downloading the archives, you should
extract all the zip files into the local directory. This may be done
synchronously! Remember the DRY principle here. Also,
once you download and extract the files into the working directory,
rerunning the code to test it may not work as expected because they will
already be there. To check for this situation, write code to check (1)
if an archive has already been downloaded before downloading it, and (2)
if the extracted directory already exists before extracting it. You may
also write code to delete those files once to check that your code
works, but be careful that you do not delete other
files in the process!
import nest_asyncio; nest_asyncio.apply()
in the
notebookdata
.exists
method with os.path or pathlib.Path to
check for path existenceThe zip files are structured differently depending on the counties.
Some counties have their data stored in the numpy format
(.npy
) while others use the CSV format (.csv
).
In addition, some counties have updated data that we
want to use (not the original, older data). In these cases, there is a
directory named mod
or update
in the second
level of the zip file (e.g. data/ab/mod
) that contains the
files we need. For a given letter (e.g. a.npy
), if there is
a file in a subdirectory of a mod
or update
directory, use it. If there is not a file in that subdirectory, use the
original file. Also ignore files with extensions that
are not .npy
or .csv
. Create a list of all the
paths (as pathlib.Path
objects) that will need to be
processed. Note that some files may reside in subdirectories (or
sub-subdirectories). Do not move the files into
different directories; keep them where they were originally
extracted!
mod
or
update
versions of files when reading through lists of
files.We will use numpy and pandas or polars for the data processing steps. (Again,
numpy, polars, and pandas are installed on tiger, but if you work
locally, you may need to install them;
conda install numpy pandas polars
should work.)
You should write a function with a match statement to handle four cases:
For each Path
p
, perform the match using
p.parts
. You may use guards but you may not use if
statements inside of the match cases!
For any npy file, load it using numpy.load
and then convert it to a dataframe. For any csv file, load it using the
appropriate method from pandas or polars. For any
non-updated file, you will need to convert the value
column by multiplying it by 10.
After reading the files and initial conversion, we will replace
-999.0
values with a null value (you can use
NaN
in pandas or None
(displayed as null) in
polars) and filter only those records related to electricity. Each file
has the columns county_name
, year
,
month
, data_class
, data_field
,
unit
, value
, and
number_of_accounts
. Note that each file also has a header
with these column names. For each file, extract those rows with the
data_class electricity
, and replace the -999.0
values in the value
and number_of_accounts
columns with a null value.
Finally, we are going to process all the files using threads via concurrent.futures. Use a ThreadPoolExecutor to run the function from Part 3 for each matching file from Part 2. Take the results from each run, and concatenate them together.
From the single concatenated dataframe, create dataframes for
each year of data from 2021 to 2023, and write three
files (2021.csv.gz
, 2022.csv.gz
, and
2023.csv.gz
) which contain only the records from the
specified year. The .gz
extension means that the file will
be compressed using the gzip algorithm to make the output smaller pandas
and polars will do this automatically if you specify that file
extension).
While this does not guarantee correct answers, the following code should produce the corresponding results:
# pandas
= pd.read_csv("2023.csv.gz")
df2023 'data_field == "1_nat_consumption"').groupby("county_name").value.mean()
df2023.query(
# polars
= pl.read_csv("2023.csv.gz")
df2023 filter(pl.col('data_field') == "1_nat_consumption").group_by("county_name").agg(pl.col('value').mean()).sort('county_name')[:10] df2023.
county_name
Albany 40641.475708
Allegany 3623.710633
Bronx 132914.500000
Broome 51036.161167
Cattaraugus 10227.548833
...
Washington 11067.477917
Wayne 15826.508435
Westchester 129336.629111
Wyoming 3091.988700
Yates 20230.106000
Name: value, Length: 62, dtype: float64