The goal of this assignment is to work with data and data management tools for spatial, graph, and temporal analysis.
You may choose to work on the second part of the assignment on a
hosted environment (e.g. Google Colab) or on your
own local installation of Jupyter and Python. You should use Python 3.8
or higher for your work (although Colab’s 3.6 should work). To use cloud
resources, create/login to a free Google account for Colab. If you choose to
work locally, Anaconda
is the easiest way to install and manage Python. If you work locally,
you may launch Jupyter Lab either from the Navigator application or via
the command-line as jupyter-lab
.
For this assignment, you should install the neo4j database, its
python driver, and the geopandas and descartes python libraries. First,
download neo4j 5.3. Then, use
conda (or mamba/pip) to install the neo4j python driver and the
geopandas and descartes libraries
(conda install neo4j-python-driver geopandas descartes
). We
will use geopandas for spatial work and neo4j for graph queries.
In this assignment, we will be working with data from the Divvy Bike Share Program in the Chicagoland area. The goal is to load the data and understand the community areas where people use the pedal and electric bikes. We will we using data from July 2022. There are three datasets:
The assignment is due at 11:59pm on Monday, May 1.
You should submit the completed notebook file named
a5.ipynb
on Blackboard. Please use Markdown
headings to clearly separate your work in the notebook.
CS 640 students are responsible for all parts; CS 490 students do not need to complete Part 4. Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells.
Use geopandas to load the community areas GeoJSON file as the
cas
dataframe. Rename the area_numbe
column to
area_number
and convert it to an int. Use the
json
module to load the station_information, and create a
pandas dataframe stations
using the data under the
data -> stations
path. You should find 1421 stations.
Create a geopandas dataframe from the pandas dataframe, and specify the
geometry using the latitude and longitude. Finally, use pandas to load
the bike trip csv file as the trips
data frame. It’s
probably easiest to download the file locally and unzip it, but you can
also load it directly from the web:
import pandas as pd
import requests
import zipfile
import io
= requests.get('https://divvy-tripdata.s3.amazonaws.com/202207-divvy-tripdata.zip')
r = zipfile.ZipFile(io.BytesIO(r.content))
zf = pd.read_csv(zf.open('202207-divvy-tripdata.csv')) trips
For Part 4, it will be useful to parse the dates in the bike trip
data. Convert the start and endpoints of each trip to a geometry
column–start_pos
and end_pos
, respectively,
and create a GeoDataFrame
. Note that only one of the
columns can be the current geometry, and you can switch geometries using
the set_geometry
method.
points_from_xy
method to construct points from longitude and latitude'EPSG:4326'
for the crs
parameterWe wish to analyze station locations by community area as well as how this impacts trips. We will generate visualizations that show the distribution of stations and the distribution of trips among community areas.
We want to know which community areas each trip begins and ends in.
We can use geopandas’ spatial join
(sjoin) functionality to compute this. Specifically, given the points
from the stations, we want to know which community areas those points
are in. After joining these two datasets, you should be able to find the
community area number (area_number
) for each station.
Now, we wish to relate the start and end locations for each trip to
the community areas. Add columns that specify the starting and ending
community area numbers (start_ca_num
and
end_ca_num
) for each trip. Here, we need to use the
start_pos
and end_pos
columns from Part 1, and
we will do two spatial joins. You can use a spatial
join as before, but make sure the geometry of the GeoDataFrame is set to
the correct point (start_pos
or end_pos
), and
then set the geometry to the other before computing the second spatial
join.
cas
that you won’t
be using to declutter the jointed data framearea_number
column to
start_ca_num
(end_ca_num
) after each spatial
join.index_right
column before
joining the other endpoint of the tripWe wish to understand which community areas have bike stations and
how that affects trips. Using geopandas, generate a plot of the number
of stations per community area, This requires creating a new data frame
by aggregating the stations by community area. Then, use the
plot
command to generate a choropleth map from the
GeoDataFrame. This is done by choosing a column to serve as the value in
the map and setting a colormap (cmap
).
We are interested in the number of trips using pedal bikes versus
those using electric bikes (ebikes). Generate a plot showing the percent
difference between the trips by electric or pedal bikes starting or
ending in each community area. Use the filter of
rideable_type == "electric_bike"
to differentiate between
ebikes and pedal bikes. Note that a trip starting in the LOOP and ending
in the NEAR WEST SIDE will add one to both of those
community areas. We need to calculate four aggrgations: the first for
all starting cas for pedal bikes, the second for all ending cas for
pedal bikes, the third for all starting cas for ebikes, and the fourth
for all ending cas for ebikes. Then combine the pedal and ebike counts
them into a single count. Finally, compute the percentage difference
between pedal bikes and ebikes per ca ((# pedal bikes - # ebikes)/#
ebikes). Then, merge with the cas, and plot using a
diverging colormap for the second visualization.
.add
and
.subtract
methods with fill_value
to do
this.We will use a graph database, neo4j, to analyze the community areas likely traversed by people riding between their start and end stations, separated by whether they traveled by ebike or pedal bike. This requires path-type queries over a graph of community areas connected by edges when they border each other. We will do this once for ebikes and then a second time for pedal bikes.
First, we need to determine a graph of community areas where an edge
indicates that the community areas border each other. To do this, we
will compute the spatial join of community areas with themselves. There
are a number of operations that can be used for a spatial join, but to
make sure that areas with common borders are joined, we will use
'intersects'
. Make sure to get rid of pairs of the same
area and deduplicate those that are listed twice. There should be 197
pairs of intersecting (bordering) areas.
To begin, we will create a new graph database and add a Local DBMS to neo4j. Do this in the neo4j desktop application by creating a new project and creating a Local DBMS associated with that project. You may wish to stop other databases from running, and then start the new DBMS you just created. If you click on the newly created DBMS, you will see the IP address and Bolt port; copy the bolt url. Back in Jupyter, we now wish to connect to this DBMS. Use the following code (potentially with a different url as copied) to connect with the password you created:
from neo4j import GraphDatabase
= GraphDatabase.driver("bolt://localhost:7687", auth=("divvy", "divvybikes")) driver
We also wish to define some helper functions to help us run queries with neo4j (based on functions by CJ Sullivan):
def run_query(query, parameters=None):
= None
session = None
response try:
= driver.session()
session = list(session.run(query, parameters))
response finally:
if session is not None:
session.close()return response
def batch_query(query, df, batch_size=10000):
# query can return whatever
= 0
batch = []
results while batch * batch_size < len(df):
= run_query(query, parameters={'rows': df.iloc[batch*batch_size:(batch+1)*batch_size].to_dict('records')})
res += 1
batch
results.extend(res)= {"total": len(results),
info "batches": batch}
print(info)
return results
From this, we wish to create a graph database that consists of the
community areas and connections (relationships) to those community areas
they border. Take the original community areas data frame from Part 2,
and use it to create a node for each community area. Next, add
relationships (BORDERS
) between community areas that border
each other (the 197 relationships from part a). After completing this,
you can go to the neo4j Bloom tool and visualize the
CommunityArea - BORDERS - CommunityArea
graph which should
look something like the following:
Next, we wish to use the graph database to compute the paths for the
trips from the bike sharing data frame. We will do this for two filtered
data frames: (1) ebikes, and (2) pedal bikes; create these two data
frames first. We will only use those paths that start and end in
different community areas because the
shortestpath
function doesn’t work with paths starting and
ending at the same node. Specifically, we wish to find a shortest path
from the community area that the trip starts in to the community area
that the trip ends in. (Note that this path is not unique.) From this
shortest path, we wish to know the community areas each path goes
through. From the results of the query, create a data frame with the
counts of the number of times a community area was part of a trip’s
path. Add the count of the trips that started and ended in the same
community to this dataframe. Make sure to do this for both ebikes and
pedal bikes, separately.
shortestpath
function that takes a path
expression as its argument(n1)-[:EDGE_TYPE]-(n2)
is
undirected while
(n1)-[:EDGE_TYPE]->(n2)
is directed. We
don’t care which way the edge goes when analyzing a path.=
nodes
function
returns the list of nodes along that path.reduce
function to pull out only some
of each node’s information (e.g. its area number). See this
post for some ideas.Generate a plot showing the percent difference between the trips by
electric or pedal bikes passing through each community area. Use the
data frames from part c along with the same community areas geodataframe
(cas
) we used for the visualizations in Part 2. This should
look similar to Part 2d. (If you want to see the difference, compute
that difference and plot it.)
Next, we wish to analyze when bikes are being used. Our interest in not simply how many trips there are, but how long cyclists keep their bikes. For example, if we want to know how many bikes were used at some point between 9am and 10am, this count would include a bike used from 8am-11am, a bike used from 9:30am-10:20am, and a bike used from 9:15am-9:40am. The count from 10-11am would include the first two bikes again.
We wish to find when the rental interval overlaps with our defined
intervals; in our case, this will be every hour. Start by creating an
interval array for the rentals from the starting_at
and
ending_at
columns from the trips data using IntervalArray.from_arrays
.
Note, however, that you will get an error because some of the trips have
timestamps out of order; drop those rows for this part of the analysis.
Now create an interval_range
that has hourly intervals from the beginning of July through the end of
the month. Compute the number of rental intervals that overlap with each
of the hourly intervals. (overlaps
helps here, but I think you will need to loop through the hourly
intervals, computing the overlaps for each one.) Create a new data frame
that has the index equal to the beginning of each hour interval, and
values equal to the number of overlapping rental intervals. From this
data frame, create a line plot that shows the number of rentals in use
during each hour. The first ten rows of the table are show below:
num_rentals_active | |
---|---|
start_hour | |
2022-07-01 00:00:00 | 399 |
2022-07-01 01:00:00 | 328 |
2022-07-01 02:00:00 | 198 |
2022-07-01 03:00:00 | 140 |
2022-07-01 04:00:00 | 113 |
2022-07-01 05:00:00 | 237 |
2022-07-01 06:00:00 | 598 |
2022-07-01 07:00:00 | 978 |
2022-07-01 08:00:00 | 1173 |
2022-07-01 09:00:00 | 1119 |
starting_at
and
ending_at
to pandas timestampsplot
method, and specifying the correct kind
parameter will
produce a line plot.Using the final data frame from part a, downsample the data to days instead of hours, summing the total. Plot this downsampled data.
resample
does most of the work.In a markdown cell, answer the following three questions: