The goal of this assignment is to work with files, iterators, strings, and string formatting in Python.
You will be doing your work in a Jupyter notebook for this
assignment. You may choose to work on this assignment on a hosted
environment (e.g. tiger)
or on your own local installation of Jupyter and Python. You should use
Python 3.12 for your work although earlier versions should also work. To
use tiger, use the credentials you received. If you work remotely, make
sure to download the .ipynb file to turn in. If you choose to work
locally, Anaconda or miniforge are
probably the easiest ways to install and manage Python. If you work
locally, you may launch Jupyter Lab either from the Navigator
application or via the command-line as jupyter lab
.
In this assignment, we will be working with data from U.S. Department of Agriculture’s Economic Research Service about weekly food sales in 43 states (others are not available due to how the data is collected) made available during the COVID-19 pandemic. I have downloaded and pre-processed the data here, converting it to a fixed-width file format written using the UTF-16 encoding. There are five fields, whose locations (inclusive) in each line are as follows:
State
: (0-13) the name of the state (contains
spaces)Date
: (16-25) the end of the week of the data
collection in yyyy-mm-dd formatCategory
: (28-54) the food category (contains
spaces)Dollars
: (57-65) the total value of salesLastYear
: (68-76) the total value of sales last
year (can be -999 which indicates no data is available)The indices not included in a range are filled with spaces; for example, characters 14 and 15 are always spaces. The data is located here. You may use the following code to download this file:
from pathlib import Path
from urllib.request import urlretrieve
# download the data if we don't have it locally
= "https://faculty.cs.niu.edu/~dakoop/cs503-2024sp/a4/food-prices-weekly.txt"
url = "food-prices-weekly.txt"
local_fname if not Path(local_fname).exists():
urlretrieve(url, local_fname)
You will read the data from that file, calculate monthly averages, and write a new output file in a similar format.
The assignment is due at 11:59pm on Wednesday, March 6.
You should submit the completed notebook file required for this
assignment on Blackboard. The
filename of the notebook should be a4.ipynb
. You
should not turn in the
food-prices-monthly.txt
file as your notebook should
contain the code to create it.
Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells. In Parts 3 and 4, CSCI 503 students must compute and output two averages and a percent change while CSCI 490 students need only compute and output one average. Do not use external libraries for reading, parsing, or writing the data files.
The first cell of your notebook should be a markdown cell with a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.
Either download the file and
upload it to Jupyter, or use the provided code above to download the
file. Then, use an iterator to read the file
food-prices-weekly.txt
into a list of dictionaries named
data
(a format similar to what we used in Assignment 3). Remember that the file is
encoded as utf-16! Also, remember that a file object will provide an
iterator if you pass it to the iter
function. Read the
header first (the first line), and then the rest of the file. The header
will serve as the keys for each dictionary while the other lines are
values. To split each line into its column values, use
slicing and remove leading and trailing whitespace. You
do not need to convert values (e.g. to integers) in this step (see Step
3). When you finish, your data should look like:
[
...,'State': 'Wyoming',
{'Date': '2023-05-07',
'Category': 'Vegetables',
'Dollars': '2454255',
'LastYear': '2308865'}
]
encoding
keyword argument to the open
function to help decode the file.iter()
functions obtains an iterator
for an object, and next()
retrieves an item. For example,
next(iter([1,2,3]))
is 1.strip
will be useful to remove whitespace.zip
to create pairs of tuples that can be used to
create a dictionary. dict(zip(['a','b'], [1,2]))
produces
the dictionary {'a': 1, 'b': 2}
.For each data item, create a new key-value pair, Month
,
from the Date
column. Given a date in yyyy-mm-dd
format, the Month
value should be yyyy-mm. Add
these pairs to the existing dictionaries in the list.
CSCI 490 students should complete (a) and CSCI 503 students should complete (b)
Create a new list of dictionaries named monthly_data
that has one entry per month for each state and
category that contains four key-value pairs (State
,
Month
, Category
, DollarsAvg
).
DollarsAvg
should be the average of the
Dollars
values for the given month, state, and category.
You will need to convert the individual values to integers before
computing averages. monthly_data
should look like:
[
...,'Month': '2023-05',
{'State': 'Wyoming',
'Category': 'Vegetables',
'DollarsAvg': 2454255.0,
} ]
Create a new list of dictionaries named monthly_data
that has one entry per month for each state and
category that contains six key-value pairs (State
,
Month
, Category
, DollarsAvg
,
LastYearAvg
, PctChange
).
DollarsAvg
and LastYearAvg
should be the
average of the Dollars
and
LastYear
values, respectively, for the given month, state,
and category. However, remember that -999 indicates
LastYear
is invalid so do not include this
value in your averages. Instead, indicate that the value is invalid by
using the nan value (math.nan
). (Update:
(2024-03-06) If either of DollarsAvg
and
LastYearAvg
have any entries for a month,
you should use the existing values to compute the monthly averages
(excluding the -999 values). However, if there are no entries for a
month, that average must be invalid (-999 or math.nan
).
Then, PctChange
is computed by \[
100 \cdot \frac{\texttt{DollarsAvg} -
\texttt{LastYearAvg}}{\texttt{DollarsAvg}}
\] (Update: (2024-03-6) Again,
PctChange
should be nan
if one of LastYearAvg
or DollarsAvg is invalid.) You will need to convert the individual
values to integers before computing averages. Do this efficiently. You
shouldn’t loop through the data more than twice (once to read, once to
write). At the end, monthly_data
should look like:
[
...,'Month': '2023-05',
{'State': 'Wyoming',
'Category': 'Vegetables',
'DollarsAvg': 2454255.0,
'LastYearAvg': 2308865.0,
'PctChange': 5.923997302643776}
]
DollarsAvg
and
LastYearAvg
at the same time, and create the dictionary for
each monthly_data
entry by accessing all data related to
the (Month, State, Category) tuple at once.Write the new monthly_data
list of dictionaries to a
file named food-prices-monthly.txt
in a the standard utf-8
format. This means writing the header values as the first line. Second,
write the floating point values for DollarsAvg
(and for
CSCI 503, LastYearAvg
) with only 1 digit
after the decimal point. CSCI 503 students should also write
PctChange
with 2 digits after the decimal
point. All the numbers should be right-aligned with at least two spaces
between each column. Write a positive sign (+
) for positive
values of PctChange
and a negative sign for negative
values. The positions of each entry (inclusive) should
be:
State
: (0-13) the name of the state (contains
spaces)Month
: (16-22) the month of the data collection in
yyyy-mm formatCategory
: (25-51) the food category (contains
spaces)DollarsAvg
: (54-64) the average value of sales for the
monthLastYearAvg
: (67-77) the average value of
sales for the month last yearPctChange
: (80-88) the percent change from
the last yearThe output file should look like (CSCI 490 students will not have the last two columns):
State Month Category DollarsAvg LastYearAvg PctChange
Alabama 2019-10 Alcohol 22639721.0 21755804.5 +3.90
Alabama 2019-11 Alcohol 22351135.0 21580366.5 +3.45
Alabama 2019-12 Alcohol 23649327.0 20516532.4 +13.25
... ... ... ... ... ...
Wyoming 2023-03 Vegetables 2354655.5 2225477.8 +5.49
Wyoming 2023-04 Vegetables 2355812.2 2252262.0 +4.40
Wyoming 2023-05 Vegetables 2454255.0 2308865.0 +5.92
print
function calls to write a few data items to
stdout
first, then when you are satisfied the format is
correct, write all of the data items to a file.print
function.with
statement to make sure all data is written
to the file (or make sure to call close
).w
flag to open
to be
able to write to a file.!cat
command:
!cat food-prices-monthly.txt