Goals

The goal of this assignment is to work with the data processing and visualization in Python.

Instructions

You will be doing your work in Python for this assignment. You may choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work. To use tiger, use the credentials you received. If you work remotely, make sure to download the .py files to turn in. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab. You may need to install some packages for this assignment: pandas, matplotlib, and altair will be useful. Use the Navigator application or the command line conda install pandas matplotlib altair to install them.

In this assignment, we will be working with data and visualizing it. We will revisit the Pokémon dataset, but instead of textual summaries, we will create tables and visualizations to gain insight.

Due Date

The assignment is due at 11:59pm on Thursday, April 22.

Submission

You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a8.ipynb.

Details

Please make sure to follow instructions to receive full credit. Because you will be writing classes and adding to them, you do not need to separate each part of the assignment. Please document any shortcomings with your code. You may put the code for each part into one or more cells.

0. Name & Z-ID (5 pts)

The first cell of your notebook should be a markdown cell with a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.

1. Maximum Combat Power (25 pts)

We will compute the maximum combat power (MaxCP) method using pandas, and find the Pokémon with the greatest MaxCP in each generation. Download and load the pokemon.json file as a pandas data frame; it is unchanged from Assignment 5 so if you still have it, you can use that same file.

a. Compute Combat Power (15 pts)

Again, the maximum combat power can be computed via information here for level 40 as:

Attack = 2 * round(attack^0.5 * sp_attack^0.5 + speed^0.5)
Defense = 2 * round(defense^0.5 * sp_defense^0.5 + speed^0.5)
Stamina = 2 * hp
MaxCP = (Attack + 15) * (Defense + 15)^0.5 * (Stamina + 15)^0.5 * 0.7903001^2 / 10

We wish to create a new column in our data frame named max_cp that stores this computed value. Do this without loops. Remember that pandas can do computations with series (columns).

b. Maximum Combat Power by Generation (10 pts)

Now, find the Pokémon with the greatest maximum combat power in each of the eight generations. Output the name of the pokemon along with it’s max_cp value.

Hints:
  • Instead of using a groupby, it may be easier to sort in decreasing order by max_cp, and find a way to locate the first of each generation.
  • drop_duplicates has a subset parameter to only consider duplicates based on a particular column.

2. Primary Types and Attack, Defense, & Speed (55 pts)

Now, let’s group by the primary type, and examine the average values of each of attack, defense, and speed.

a. Bar Chart (15 pts)

Using matplotlib directly or via pandas’ plotting routines, create a grouped bar chart that shows the mean attack, defense, and speed for each primary type. Set the figure size wider to improve the visualization. Include a legend.

Example Solution for Part 2a
Hints:
  • Here, a groupby will work well, and pandas’ plot routine will treat each column as a separate bar for each group.
  • You can use the width property to control the amount of whitespace between bars.

b. Scatterplot (15 pts)

From the bar chart, we think there may be a negative correlation between attack and defense averages versus speed averages. First, compute a data frame that includes the mean attack, defense, and speed averages for each primary type, and then add the attack and defense means together to create a new column. Plot this new column versus the speed column. Label the axes appropriately. You should see one significant outlier with respect to the negative correlation. Determine the primary type of this outlier.

Hints:
  • You can select multiple columns to be processed after a groupby, and the same aggregation can be applied to all of them.

c. Interactive Scatter Matrix (25 pts)

Now, let’s use all three attributes and use altair to create a scatter matrix. A scatter matrix is a bunch of scatterplots, one for each pair of variables. Begin by creating a scatterplot that compares just attack and speed. Once you have this, change the x and y values to use the repeat capabilities of altair, and set the repeat over the speed, attack, and defense attributes. Finally, add a brush to the visualization so that we can interactively check relationships between the individual attributes. For example, try creating a rectangular selection over those primary types with high average attack and defense to see where their speeds land.

Example Solution for Part 2c
Hints:

3. Attack-Defense Distribution (30 pts)

a. Bubble Chart (15 pts)

Using altair, create a bubble chart visualization that shows Pokémon positioned by attack and defense, sized by speed, and colored by generation. Make sure you aren’t biasing the visualization toward showing more from the later generations; this occurs by having later generations plotted on top of the older generations.

Example Solution for Part 3a
Hints:
  • Consider the size and color encodings to create a bubble chart
  • To obtain a different colormap for the generation, consider tagging it with a different type
  • Investigate the sample method of a pandas dataframe to reorder data items

b. Binned Scatterplot (15 pts)

Since Part b led to major overdraw problems (points occluding other points), we will create a binned scatterplot that shows the count of the number of Pokémon in a particular attack/defense bin. This will help us better see the distribution of Pokémon according to attack and defense measures. Use 1200 bins (40x30) or some other reasonable number that shows the trend.

Example Solution for Part 3b
Hints:
  • You can do aggregations in a field definition
  • Bin transforms are very useful here. It is likely easier than trying to do this using pandas.