The goal of this assignment is to work with the data processing and visualization in Python.
You will be doing your work in Python for this assignment. You may choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work. To use tiger, use the credentials you received. If you work remotely, make sure to download the .py files to turn in. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab
. You may need to install some packages for this assignment: pandas, matplotlib, and altair will be useful. Use the Navigator application or the command line conda install pandas matplotlib altair
to install them.
In this assignment, we will be working with data and visualizing it. We will revisit the Utility Energy Registry dataset, but instead of textual summaries, we will create tables and visualizations to gain insight. The data has been pre-processed and is available as a compressed csv file at http://faculty.cs.niu.edu/~dakoop/cs503-2021fa/a8/ny-county-energy.csv.gz. Note that you do not need to download or uncompress the data because pandas can load the url directly as a csv file. This data has the following fields:
county_name
: the name of the countyyear
: the year the data was collectedmonth
: the month the data was collecteddata_class
: electricity or natural_gasdata_field_display_name
: the field measured (e.g. “Total Consumption (T)”)unit
: the unit of measurement (MWh, Therms)value
: the value of the measurementnumber_of_accounts
: the number of accounts included in the measurementThe assignment is due at 11:59pm on Friday, December 3.
You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a8.ipynb
.
Please make sure to follow instructions to receive full credit. Because you will be writing classes and adding to them, you do not need to separate each part of the assignment. Please document any shortcomings with your code. You may put the code for each part into one or more cells.
The first cell of your notebook should be a markdown cell with a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.
First, we will use pandas to compute statistics to find the New York counties with large electricity usage.
Load the dataset and filter out any records whose data field is not Total Consumption (T)
as that is the only field we will be using in this assignment. Compute the county and year+month which used the most electricity. For this part, make sure you choose the correct data class. In your notebook, be clear about which county, year, and month had the highest electricity use.
&
operator, but remember to use parentheses.max
.Next, compute the county that used the most electricity per account in 2020. Again, we use the Total Consumption (T)
category, but need to add up both the usage and number of accounts over all twelve months. Then, we need to compute a derived column by dividing the usage by the number of accounts. Do this all without loops. The answer will be a different county than in part a. Again, make sure to state the answer in your notebook.
groupby
and an aggregation operation to compute the totals.assign
method.Next, let’s use visualization to examine trends in electricity and natural gas use over time. We will use matplotlib (via pandas) for this.
Using matplotlib directly or via pandas’ plotting routines, create a visualization with two subplots, one for electricity and one for natural gas. Each plot should show a bar chart of the state’s total use for each year for the corresponding energy class. If you would like pandas to plot this automatically, use the subplots=True
option, but you will need your data to look correct first. Specifically, rows should be years and columns should be the type of energy (electricity and natural_gas). Note that you should title the plots according to the type of energy. You do not need to label the axes, but will receive extra credit if you.
unstack
will be quite useful if you set up your data correctly. Note that unstack can take a parameter that indicates the level of the indexAgain, using matplotlib directly or via pandas, create a visualization with two subplots for energy and natural gas, but this time showing the total per month as line charts. Here, we need to manipulate the data first to get meaningful dates first. You can use pandas’ to_datetime
method to accomplish this, but you will need to pass it either a series of string values that look like dates (e.g. ‘2020-1’) or a data frame that has the columns year, month, and day. (If you go the second route, assign
will help add a day column.). Then, similar groupby-aggregation steps will work as in Part 2a. You should again title the plots according to the type of energy. You do not need to label the axes, but will receive extra credit if you.
title
parameter is useful for labeling subplots in pandasGiven the interesting patterns between electricity and natural gas use over the seasons, we are interested in seeing how this plays out in different counties by month. We will use altair to explore this.
Using altair, create a set of twelve scatterplots for 2020 that show electricity versus natural gas use. Use the altair’s facet
capabilities to draw a scatterplot for each month. Add a tooltip that shows the county name so that moving the mouse over the point shows its name. Draw the scatterplots in a 4x3 layout.
facet
has an option for the number of columns to useNow add brushing to the set of twelve scatterplots such that selecting a point corresponding to a particular county in one of the scatterplots highlights that same county in all of the other months. Show this highlight by both changing the color and the size of the point.