The goal of this assignment is to work with the file system, concurrency, and basic data processing in Python.
You will be doing your work in Python for this assignment. You may
choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local
installation of Jupyter and Python. You should use Python 3.10 or higher
for your work, although earlier versions (>= 3.8) should work for
this assignment. To use tiger, use the credentials you received. If you
work remotely, make sure to download the .py files to turn in. If you
choose to work locally, Anaconda is the easiest way
to install and manage Python. If you work locally, you may launch
Jupyter Lab either from the Navigator application or via the
command-line as jupyter-lab
. You may need to install some
packages for this assignment: aiohttp (or requests) and pandas. Use the
Navigator application or the command line
conda install pandas requests aiohttp
to install them.
In this assignment, we will examine musical artists with Wikipedia articles that are classified as best-selling. These best-selling artists are divided into tables by the number of claimed sales, and their home country is also noted. Wikipedia also tracks statistics like page views over time. I have downloaded data from these tables as well as page view statistics for some of the artists, and made it available on the course web site in six different zip files. You will download those zip files, extract the files from the archives, load the page view statistics, and construct a data frame with all of the page view statistics. You will use threading to download and process the data.
The assignment is due at 11:59pm on Monday, November 21.
You should submit the completed notebook file required for this
assignment on Blackboard. The
filename of the notebook should be a7.ipynb
.
Please make sure to follow instructions to receive full credit. You may put the code for each part into one or more cells. Note that CS 503 Students must use asyncio which is optional for CS 490 students.
The first cell of your notebook should be a markdown cell with a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.
There are six zip files posted on the course web site at https://faculty.cs.niu.edu/~dakoop/cs503-2022fa/a7/ that
you will download using python. Their filenames are structured like
<start>_million_to_<end>_million.zip
where
<start>
and <end>
come from the
tuples
[(75,79), (80,99), (100,119), (120,199), (200,249), (250,999)]
.
Do not hardcode the filenames but instead use the
information above to construct them programmatically. CSCI 503 students
will use aiohttp for
this task while CSCI 490 students may use the requests library.
(If you have trouble with this part and wish to continue with other
parts of the assignment, download the files manually.)
To download the files, use the requests library.
(This library is installed on tiger, but if you work locally, you may
need to install it; conda install requests
should work.)
Note that each download is a two-step process: first “get”-ing the file
and then writing the response to a file. Consult the documentation.
After downloading the archives, you should extract all six zip files
into a single directory artist-data
. Remember the DRY
principle here. Also, once you download and extract the files into the
working directory, rerunning the code to test it may not work as
expected because they will already be there. To check for this
situation, write code to check (1) if an archive has already been
downloaded before downloading it, and (2) if the extracted directory
already exists before extracting it. You may also write code to delete
those files once to check that your code works, but be
careful that you do not delete other files in the process!
exists
method with os.path or pathlib.Path to
check for path existenceextract_dir
(shutil) or path
(zipfile)
parameters will be useful.To download the files, use the aiohttp library and
asyncio to download all the files. (This library is installed on tiger,
but if you work locally, you may need to install it;
conda install aiohttp
should work. You may also need to
install nest_asyncio
via pip.) While you can refer to the
example
we showed in class, remember that you need to save each
file locally after downloading it. This should be handled as a separate
async coroutine, but note that the code in the method will be
synchronous because operating systems generally do not support
asynchronous file I/O. After downloading the archives, you should
extract all six zip files into a single directory
artist-data
. This may be done
synchronously! Remember the DRY principle here. Also,
once you download and extract the files into the working directory,
rerunning the code to test it may not work as expected because they will
already be there. To check for this situation, write code to check (1)
if an archive has already been downloaded before downloading it, and (2)
if the extracted directory already exists before extracting it. You may
also write code to delete those files once to check that your code
works, but be careful that you do not delete other
files in the process!
import nest_asyncio; nest_asyncio.apply()
in the
notebookexists
method with os.path or pathlib.Path to
check for path existenceextract_dir
(shutil) or path
(zipfile)
parameters will be useful.Now, write code to find all files in the unzipped directories that
end with the file extension .npy
. Note that each directory
has subdirectories, and those subdirectories have different files. You
should find 32 npy files in the various subdirectories, and your code
should find the paths to those files.
Finally, we are going to process all the files using threads via concurrent.futures.
Use numpy to read the data and then pandas to construct data frames.
(Again, numpy and pandas are installed on tiger, but if you work
locally, you may need to install them;
conda install numpy pandas
should work.) Each
.npy
file is a serialization of a numpy array. These arrays
are structured like
20211101, 20211201, ..., 20220901, 20221001],
array([[92144, 100668, ..., 76193, 116814]]) [
np.load
to load this array from the
.npy
file. The filename encodes the artist’s name in all
lowercase characters with dashes replacing spaces. We want to create a
dataframe with columns for the artist’s name, the month, and the page
views. Thus, for george-strait.npy
, we want to create a
table like this:
Date | Views | Artist | |
---|---|---|---|
0 | 20211101 | 92144 | George Strait |
1 | 20211201 | 100668 | George Strait |
… | … | … | … |
10 | 20220901 | 76193 | George Strait |
11 | 20221001 | 116814 | George Strait |
This means
We want to do this for all the npy files. Use a ThreadPoolExecutor to process (read, convert, update) each matching file from Part 2. Take the results from each run, and concatenate them together.
From the single concatenated dataframe, create 12 dataframes, one for
each month, and write 12 files with the month
(e.g. 20211101.csv.gz
) which contain only the records from
the specified month. The .gz
extension means that the file
will be compressed using the gzip algorithm to make the output smaller
(pandas will do this automatically if you specify that file extension).
Each of the csv files should have 33 lines (one for each of the 32
artists plus one line for the column names).
columns
argument in the DataFrame
constructorindex
argument can be useful if
you don’t want to write the index.assign
method that allows you to assign a single value to a column
(e.g. Artist) for every row.concat
method useful for combining data frames.