The goal of this assignment is to work with the data processing and visualization in Python.
You will be doing your work in Python for this assignment. You may
choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local
installation of Jupyter and Python. You should use Python 3.12 for your
work, but other recent versions should also work. To use tiger, use the
credentials you received. If you choose to work locally, Anaconda or miniforge are
probably the easiest ways to install and manage Python. If you work
locally, you may need to install some packages in addition to those from
previous assignments for this assignment: matplotlib and altair. Use the
Navigator application or the command line
conda install matplotlib altair
to install them.
In this assignment, we will be working with data and visualizing it. We will revisit the Utility Energy Registry dataset, but instead of textual summaries, we will create tables and visualizations to gain insight. The data has been pre-processed and is available as a compressed csv file at http://faculty.cs.niu.edu/~dakoop/cs503-2024sp/a8/ny-county-energy.csv.gz. Note that you do not need to download or uncompress the data because pandas can load the url directly as a csv file. This data has the following fields:
county_name
: the name of the countyyear
: the year the data was collectedmonth
: the month the data was collecteddata_class
: electricity or natural_gasdata_field
: a tag for the field measureddata_field_display_name
: the field measured in
human-readable form (e.g. “Total Consumption (T)”)unit
: the unit of measurement (MWh, Therms)value
: the value of the measurementnumber_of_accounts
: the number of accounts included in
the measurementThe assignment is due at 11:59pm on Thursday, May 2.
You should submit the completed notebook file required for this
assignment on Blackboard. The
filename of the notebook should be a8.ipynb
.
Please make sure to follow instructions to receive full credit. Because you will be writing classes and adding to them, you do not need to separate each part of the assignment. Please document any shortcomings with your code. You may put the code for each part into one or more cells.
The first cell of your notebook should be a markdown cell with a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.
First, we will use pandas to compute statistics to find the New York counties with large natural gas usage.
Load the dataset. Compute the county and year+month which used the
most natural gas (the data field display name is
"Total Consumption (T)"
while the data_field
is "6_nat_consumption"
). In your notebook, be clear about
which county, year, and month had the highest natural gas use.
&
operator, but remember to use parentheses.max
.Next, compute the county that used the most natural
gas per account. Again, we use the
Total Consumption (T)
("6_nat_consumption"
)
category, but need to add up both the usage and number of accounts over
all the months. Then, we need to compute a derived column by dividing
the usage by the number of accounts. Do this all
without loops. The answer will be a different county
than in part a. Again, make sure to state the answer in your
notebook.
groupby
and an aggregation
operation to compute the totals.assign
method.Next, let’s use visualization to examine trends in total electricity and natural gas use over time. We will use matplotlib (via pandas) for this.
Using matplotlib directly or via pandas’ plotting routines, create a
visualization with two subplots, one for electricity and one for natural
gas. Each plot should show a bar chart of the state’s
average use (over all years) for each month for the
corresponding energy class. You can extract only the total consumption
by querying for the Total Consumption (T)
data_field_display_name
. (The individual data_fields are
3_nat_consumption
(electricity) and
6_nat_consumption
(natural gas).) If you would like pandas
to plot this automatically, use the subplots=True
option,
but you will need your data to look correct first. Specifically, rows
should be months and columns should be the type of energy (electricity
and natural_gas). Note that you should title the plots according to the
type of energy. You do not need to label the axes, but will receive
extra credit if you.
unstack
may be useful if you set up your data correctly. Note that unstack can
take a parameter that indicates the level of the
indexAgain, using matplotlib directly or via pandas, create a
visualization with two subplots for energy and natural gas, but this
time showing the total per month as line charts. Here,
we need to manipulate the data first to get meaningful dates first. You
can use pandas’ to_datetime
method to accomplish this, but you will need to pass it either a data
frame that has the columns year, month, and day, or a series of string
values that look like dates (e.g. ‘2023-01’). (If you go the first
route, assign
will help add a day column.). Then, similar groupby-aggregation steps
will work as in Part 2a. You should again title the plots according to
the type of energy. You do not need to label the axes, but will receive
extra credit if you do.
title
parameter is useful for labeling subplots in
pandasGiven the interesting patterns between electricity and natural gas use over the seasons, we are interested in seeing how this plays out in different counties by month. We will use altair to explore this.
Using altair, create a set of twelve scatterplots for 2023 that show
electricity versus natural gas use. Use the altair’s facet
capabilities to draw a scatterplot for each month. Add a tooltip that
shows the county name so that moving the mouse over the point shows its
name. Draw the scatterplots in a 4x3 layout.
facet
has an option for the number of columns to
useNow add brushing to the set of twelve scatterplots such that selecting a point corresponding to a particular county in one of the scatterplots highlights that same county in all of the other months. Show this highlight by both changing the color and the size of the point.