The goal of this assignment is to work with data and data management tools for spatial, graph, and temporal analysis.
You may choose to work on the second part of the assignment on a
hosted environment (e.g. Google Colab) or on your
own local installation of Jupyter and Python. You should use Python 3.8
or higher for your work (although Colab’s 3.6 should work). To use cloud
resources, create/login to a free Google account for Colab. If you choose to
work locally, Anaconda
is the easiest way to install and manage Python. If you work locally,
you may launch Jupyter Lab either from the Navigator application or via
the command-line as jupyter-lab
.
For this assignment, you should install the neo4j database, its
python driver, and the geopandas and descartes python libraries. First,
download neo4j 4.4. Then, use
conda to install the neo4j python driver and the geopandas and descartes
libraries
(conda install neo4j-python-driver geopandas descartes
). We
will use geopandas for spatial work and neo4j for graph queries.
In this assignment, we will be working with data from the Divvy Bike Share Program in the Chicagoland area. The goal is to load the data and understand the community areas where people use the bikes. We will we using data from July 2021. There are two datasets:
The assignment is due at 11:59pm on Friday, April 29.
You should submit the completed notebook file named
a5.ipynb
on Blackboard. Please use Markdown
headings to clearly separate your work in the notebook.
CS 680 students are responsible for all parts; CS 490 students do not need to complete Part 4. Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells.
Use geopandas to load the community areas GeoJSON file. Rename the
‘area_numbe’ column to ‘area_number’ and convert it to an int. Use
pandas to load the bike trip csv file. You will not be able to have
pandas load the zip file directly because it contains an extra metadata
directory. Get rid of any data for which the start and end station ids
are missing. Next, extract the start_station_id
,
start_lng
, and start_lat
columns and rename
them to station_id
, lng
and lat
.
Do the same for the end_*
columns and concatentate the
starts and ends. Then group by the station id and compute the mean
latitude and longitude for each station id. Finally, create a
GeoDataFrame from this for the stations using the longitude and
latitude. We will use this data frame for the station locations. There
should be 843 711 stations.
points_from_xy
method to construct points from longitude and latitude'EPSG:4326'
for the crs
parameterWe wish to analyze station locations by community area as well as how this impacts trips. We will generate visualizations that show the distribution of stations and the distribution of trips among community areas.
We want to know which community areas each station is in. We can use
geopandas’ spatial join
(sjoin) functionality to compute this. Specifically, given the points
from the station locations, we want to know which community areas those
points are in. After joining these two datasets, you should be able to
find the community area number (area_number
) for each
station_id
.
Use the updated station data frame from part a with the bike trip
data to add columns that specify the starting and ending community area
numbers (start_ca_num
and end_ca_num
) for each
trip. Use the start_station_id
and
end_station_id
and the result from part a to set these.
map
method to map one series through another.We wish to understand which community areas have bike stations and
how that affects trips. Using geopandas, generate two plots: (1) the
number of stations per community area, and (2) the number of trips
starting or ending in each community area. Both require creating a new
data frame by aggregating the stations/trips by community area. Then,
use the plot
command to generate a choropleth map from the
GeoDataFrames. This is done by choosing a column to serve as the value
in the map and setting a colormap (cmap
). Note that for
(2), a trip starting in the LOOP and ending in the NEAR WEST SIDE will
add one to both of those community areas.
groupby
, but you will need to recreate a
GeoDataFrame from the resulting “vanilla” pandas data frame to plot it.
The dissolve
method also works, but is slower because it
aggregates geometry (something unnecessary for our work).We will use a graph database, neo4j, to analyze the community areas likely traversed by people riding between their start and end stations. This requires path-type queries over a graph of community areas connected by edges when they border each other.
First, we need to determine a graph of community areas where an edge
indicates that the community areas border each other. To do this, we
will compute the spatial join of community areas with themselves. There
are a number of operations that can be used for a spatial join, but to
make sure that areas with common borders are joined, we will use
'intersects'
. Make sure to get rid of pairs of the same
area and deduplicate those that are listed twice. There should be 197
pairs of intersecting (bordering) areas.
To begin, we will create a new graph database and add a Local DBMS to neo4j. Do this in the neo4j desktop application by creating a new project and creating a Local DBMS associated with that project. You may wish to stop other databases from running, and then start the new DBMS you just created. If you click on the newly created DBMS, you will see the IP address and Bolt port; copy the bolt url. Back in Jupyter, we now wish to connect to this DBMS. Use the following code (potentially with a different url as copied) to connect with the password you created:
from neo4j import GraphDatabase
= GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "chicago")) driver
We also wish to define some helper functions to help us run queries with neo4j (based on functions by CJ Sullivan):
def run_query(query, parameters=None):
= None
session = None
response try:
= driver.session()
session = list(session.run(query, parameters))
response finally:
if session is not None:
session.close()return response
def batch_query(query, df, batch_size=10000):
# query can return whatever
= 0
batch = []
results while batch * batch_size < len(df):
= run_query(query, parameters={'rows': df.iloc[batch*batch_size:(batch+1)*batch_size].to_dict('records')})
res += 1
batch
results.extend(res)= {"total": len(results),
info "batches": batch}
print(info)
return results
From this, we wish to create a graph database that consists of the
community areas and connections (relationships) to those community areas
they border. Take the original community areas data frame from Part 2,
and use it to create a node for each community area. Next, add
relationships (BORDERS
) between community areas that border
each other (the 197 relationships from part a). After completing this,
you can go to the neo4j Bloom tool and visualize the
CommunityArea - BORDERS - CommunityArea
graph which should
look something like the following:
Next, we wish to use the graph database to compute the paths for the
trips from the bike sharing data frame. We will only use those paths
that start and end in different community areas because
the shortestpath
function doesn’t work with paths starting
and ending at the same node. Specifically, we wish to find a shortest
path from the community area that the trip starts in to the community
area that the trip ends in. (Note that this path is not unique.) From
this shortest path, we wish to know the community areas each path goes
through. From the results of the query, create a second data frame with
the counts of the number of times a community area was part of a trip’s
path.
shortestpath
function that takes a path
expression as its argument(n1)-[:EDGE_TYPE]-(n2)
is
undirected while
(n1)-[:EDGE_TYPE]->(n2)
is directed. We
don’t care which way the edge goes when analyzing a path.=
nodes
function
returns the list of nodes along that path.reduce
function to pull out only some
of each node’s information (e.g. its area number). See this
post for some ideas.Visualize the counts of the community areas being part of a trip. Use the second data frame from part c along with the same geo data frame we used in the visualizations in Part 2.
Next, we wish to analyze when bikes are being used. Our interest in not simply how many trips there are, but how long cyclists keep their bikes. For example, if we want to know how many bikes were used at some point between 9am and 10am, this count would include a bike used from 8am-11am, a bike used from 9:30am-10:20am, and a bike used from 9:15am-9:40am. The count from 10-11am would include the first two bikes again.
We wish to find when the rental interval overlaps with our defined
intervals; in our case, this will be every hour. Start by creating an
interval array for the rentals from the starting_at
and
ending_at
columns from the trips data using IntervalArray.from_arrays
.
Note, however, that you will get an error because some of the trips have
timestamps out of order; drop those rows for this part of the analysis.
Now create an interval_range
that has hourly intervals from the beginning of July through the end of
the month. Compute the number of rental intervals that overlap with each
of the hourly intervals. (overlaps
helps here, but I think you will need to loop through the hourly
intervals, computing the overlaps for each one.) Create a new data frame
that has the index equal to the beginning of each hour interval, and
values equal to the number of overlapping rental intervals. From this
data frame, create a line plot that shows the number of rentals in use
during each hour. The first ten rows of the table are show below
(updated):
num_rentals_active | |
---|---|
start_hour | |
2021-07-01 00:00:00 | 261 |
2021-07-01 01:00:00 | 206 |
2021-07-01 02:00:00 | 122 |
2021-07-01 03:00:00 | 66 |
2021-07-01 04:00:00 | 66 |
2021-07-01 05:00:00 | 234 |
2021-07-01 06:00:00 | 598 |
2021-07-01 07:00:00 | 1063 |
2021-07-01 08:00:00 | 1258 |
2021-07-01 09:00:00 | 958 |
starting_at
and
ending_at
to pandas timestampsplot
method, and specifying the correct kind
parameter will
produce a line plot.Using the final data frame from part a, downsample the data to days instead of hours, summing the total. Plot this downsampled data.
resample
does most of the work.In a markdown cell, answer the following three questions: