The goal of this assignment is to work with data and data management tools for spatial, graph, and temporal analysis.
You may choose to work on the second part of the assignment on a hosted environment (e.g. Google Colab) or on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work (although Colab’s 3.6 should work). To use cloud resources, create/login to a free Google account for Colab. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab
.
For this assignment, you should install the neo4j database, its python driver, and the geopandas and descartes python libraries. First, download neo4j 4.2. Then, use conda to install the neo4j python driver and the geopandas and descartes libraries (conda install neo4j-python-driver geopandas descartes
). We will use geopandas for spatial work and neo4j for graph queries.
In this assignment, we will be working with data from the Divvy Bike Share Program in the Chicagoland area. The goal is to load the data and understand the community areas where people use the bikes. We will we using data from July 2020; this is during the COVID-19 pandemic so any analyses will be affected by the situation at that time. There are three datasets:
The assignment is due at 11:59pm on Thursday, April 22.
You should submit the completed notebook file named a5.ipynb
on Blackboard. Please use Markdown headings to clearly separate your work in the notebook.
CS 680 students are responsible for all parts; CS 490 students do not need to complete Part 4. Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells.
Use pandas to load the bike trip csv file. Get rid of any data for which the start and end station ids are missing, and convert the start_station_id
and end_station_id
columns to ints. Use geopandas to load the community areas GeoJSON file. Rename the ‘area_numbe’ column to ‘area_number’ and convert it to an int. Use pandas to load the stations csv file, but convert it to a GeoDataFrame and set its geometry to the point specified by the lon(gitude), lat(itude) pair.
points_from_xy
method to construct points from longitude and latitudeWe wish to analyze station locations by community area as well as how this impacts trips. We will generate visualizations that show the distribution of stations and the distribution of trips among community areas.
We want to know which community areas each station is in. We can use geopandas’ spatial join (sjoin) functionality to compute this. Specifically, given the points from the station locations, we want to know which community areas those points are in. After joining these two datasets, you should be able to find the community area number (area_number
) for each station_id
.
crs
of the points to the same as the community areas.Use the updated station data frame from part a with the bike trip data to add columns that specify the starting and ending community area numbers (start_ca_num
and end_ca_num
) for each trip. Use the start_station_id
and end_station_id
and the result from part a to set these.
We wish to understand which community areas have bike stations and how that affects trips. Using geopandas, generate two plots: (1) the number of stations per community area, and (2) the number of trips starting or ending in each community area. Both require creating a new data frame by aggregating the stations/trips by community area. Then, use the plot
command to generate a choropleth map from the GeoDataFrames. This is done by choosing a column to serve as the value in the map and setting a colormap (cmap
). Note that for (2), a trip starting in the LOOP and ending in the NEAR WEST SIDE will add one to each of those community areas.
groupby
, but you will need to recreate a GeoDataFrame from the resulting “vanilla” pandas data frame to plot it. The dissolve
method also works, but is slower because it aggregates geometry (something unnecessary for our work).We will use a graph database, neo4j, to analyze the community areas likely traversed by people riding between their start and end stations. This requires path-type queries over a graph of community areas connected by edges when they border each other.
First, we need to determine a graph of community areas where an edge indicates that the community areas border each other. To do this, we will compute the spatial join of community areas with themselves. There are a number of operations that can be used for a spatial join, but to make sure that areas with common borders are joined, we will use 'intersects'
. Make sure to get rid of pairs of the same area and deduplicate those that are listed twice. There should be 197 pairs of intersecting (bordering) areas.
To begin, we will create a new graph database and add a Local DBMS to neo4j. Do this in the neo4j desktop application by creating a new project and creating a Local DBMS associated with that project. You may wish to stop other databases from running, and then start the new DBMS you just created. If you click on the newly created DBMS, you will see the IP address and Bolt port; copy the bolt url. Back in Jupyter, we now wish to connect to this DBMS. Use the following code (potentially with a different url as copied) to connect with the password you created:
from neo4j import GraphDatabase
= GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "chicago")) driver
We also wish to define some helper functions to help us run queries with neo4j (based on functions by CJ Sullivan):
def run_query(query, parameters=None):
= None
session = None
response try:
= driver.session()
session = list(session.run(query, parameters))
response finally:
if session is not None:
session.close()return response
def batch_query(query, df, batch_size=10000):
# query can return whatever
= 0
batch = []
results while batch * batch_size < len(rows):
= run_query(query, parameters={'rows': df.iloc[batch*batch_size:(batch+1)*batch_size].to_dict('records')})
res += 1
batch
results.extend(res)= {"total": len(results),
info "batches": batch}
print(info)
return results
From this, we wish to create a graph database that consists of the community areas and connections (relationships) to those community areas they border. Take the original community areas data frame from Part 2, and use it to create a node for each community area. Next, add relationships (BORDERS
) between community areas that border each other (the 197 relationships from part a). After completing this, you can go to the neo4j Bloom tool and visualize the CommunityArea - BORDERS - CommunityArea
graph which should look something like the following:
Next, we wish to use the graph database to compute the paths for the trips from the bike sharing data frame. We will only use those paths that start and end in different community areas because the shortestpath
function doesn’t work with paths starting and ending at the same node. Specifically, we wish to find a shortest path from the community area that the trip starts in to the community area that the trip ends in. (Note that this path is not unique.) From this shortest path, we wish to know the community areas each path goes through. From the results of the query, create a second data frame with the counts of the number of times a community area was part of a trip’s path.
shortestpath
function that takes a path expression as its argument(n1)-[:EDGE_TYPE]-(n2)
is undirected while (n1)-[:EDGE_TYPE]->(n2)
is directed. We don’t care which way the edge goes when analyzing a path.=
nodes
function returns the list of nodes along that path.reduce
function to pull out only some of each node’s information (e.g. its area number). See this post for some ideas.Visualize the counts of the community areas being part of a trip. Use the second data frame from part c along with the same geo data frame we used in the visualizations in Part 2.
Next, we wish to analyze when bikes are being used. Our interest in not simply how many trips there are, but how long cyclists keep their bikes. For example, if we want to know how many bikes were used at some point between 9am and 10am, this count would include a bike used from 8am-11am, a bike used from 9:30am-10:20am, and a bike used from 9:15am-9:40am. The count from 10-11am would include the first two bikes again.
We wish to find when the rental interval overlaps with our defined intervals; in our case, this will be every hour. Start by creating an interval array for the rentals from the starting_at
and ending_at
columns from the trips data using IntervalArray.from_arrays
. Note, however, that you will get an error because some of the trips have timestamps out of order; drop those rows for this part of the analysis. Now create an interval_range
that has hourly intervals from the beginning of July through the end of the month. Compute the number of rental intervals that overlap with each of the hourly intervals. (overlaps
helps here, but I think you will need to loop through the hourly intervals, computing the overlaps for each one.) Create a new data frame that has the index equal to the beginning of each hour interval, and values equal to the number of overlapping rental intervals. From this data frame, create a line plot that shows the number of rentals in use during each hour. The first ten rows of the table are show below:
num_rentals_active | |
---|---|
start_hour | |
2020-07-01 00:00:00 | 141 |
2020-07-01 01:00:00 | 176 |
2020-07-01 02:00:00 | 132 |
2020-07-01 03:00:00 | 70 |
2020-07-01 04:00:00 | 83 |
2020-07-01 05:00:00 | 157 |
2020-07-01 06:00:00 | 467 |
2020-07-01 07:00:00 | 770 |
2020-07-01 08:00:00 | 855 |
2020-07-01 09:00:00 | 739 |
starting_at
and ending_at
to pandas timestampsplot
method, and specifying the correct kind
parameter will produce a line plot.Using the final data frame from part a, downsample the data to days instead of hours, summing the total. Plot this downsampled data.
resample
does most of the work.In a markdown cell, answer the following three questions: