The goal of this assignment is to work with data and data management tools for spatial, graph, and temporal analysis.
You may choose to work on this assignment on a hosted environment
(e.g. Google Colab) or
on your own local installation of Jupyter and Python. You should use
Python 3.12 or higher for your work. To install Python, consider using
uv, miniforge, or Anaconda. Also, install JupyterLab (or Jupyter
Notebook). For this assignment, you will need to make sure the polars,
geopandas, shapely, and neo4j-python-driver packages are installed. If
you have trouble loading parquet files, install pyarrow
(conda install pyarrow). In addition, download neo4j.
In this assignment, we will be working with data from the Federal Aviation Administration and from its ADB-B system. The goal is to load the data and understand the air traffic control regions and there relationship to airports and flight paths. We will we using data from November 17, 2025, gathered and made available by adsb.lol. This data has been significantly processed from its raw form. There are three datasets:
The assignment is due at 11:59pm on Friday, December 5.
You should submit the completed notebook file named
a5.ipynb on Blackboard. Please use Markdown
headings to clearly separate your work in the notebook.
CS 640 students are responsible for all parts; CS 490 students do not need to complete Part 4. Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells.
Use geopandas to load the ARTCC boundaries GeoJSON file and the airport locations parquet file. These files already identify their associated geometry; for the boundaries, this is a polygon, and for the airports, this is a point. Also, use polars (or pandas) to load the flight traces parquet file. This file contains entires with the icao (a unique 24-bit identifier assigned to an aircraft), flight_id (an identifier that distinguishes different flights by the same aircraft), datetime, lat, lon, altitude, speed, and information about the aircraft.
The flights data is composed of coordinates but this does not tell us which the airports that the flights start and end at, nor the air route traffic control centers they pass through. We will add this information so we can analyze traffic at airports and through different traffic control areas. We will also generate visualizations of this data.
We wish to relate the start and end locations for each flight to the
airports. Create a new dataframe that has icao,
flight_id, start_dt, end_dt,
origin_apt, and dest_apt columns. The
start_dt and end_dt columns should contain the
starting and ending timestamps for each flight, while the
origin_apt and dest_apt columns should contain
the nearest airport to the starting and ending positions of each flight,
respectively. It will be easiest to first compute the raw data
corresponding to the start and end of the flight before worrying about
the spatial join. In other words, create a dataframe that identifies the
start_dt, start_lat, start_lon,
end_dt, end_lat, and end_lon for
each flight, in addition to the icao and
flight_id columns.
Then, create a geopandas GeoDataFrame with the starting and ending
positions as points. Whenever you create a geopandas GeoDataFrame from
latitude and longitude, make sure to set the coordinate reference system
(crs) to the string “EPSG:4326”. Next, you will need to compute a
spatial join to find the nearest airports. You will need to do
two spatial joins, one for the origin and one for the
destination. Note that you must first identify the starting and ending
positions for each flight. Then, use the geopandas
sjoin_nearest function to find the nearest airport to each
of the positions.
arg_min and arg_max,
respectively. These also work inside of aggregations.In order to analyze the paths of the flights, we need to convert the
lat/lon sequences for each flight into a linestring geometry (a sequence
of line segments). You can do this by grouping the data by
icao and flight_id and then aggregating the
lat and lon columns into a list of
coordinates. Then, you can use shapely’s LineString
constructor to convert the list of coordinates into a linestring. Create
a new GeoDataFrame that has the icao,
flight_id, and this new geometry. Join this with your
dataframe from Part a so that you have a single GeoDataFrame with the
icao, flight_id, start_dt,
end_dt, origin_apt, dest_apt, and
geometry columns.
concat_list
function may be very useful.apply method in coordination with an assignment
to the geometry column.Now perform a spatial join between the flight paths and the ARTCC
boundaries to determine which traffic control areas each flight passes
through. You can use geopandas’ sjoin function with the
'intersects' operation to compute this. After performing
the spatial join, create a new dataframe that has the icao,
flight_id, and a list of the unique name
values for the ATC areas that each flight passes through. The following
table shows information for ten random flights. Note that the
artcc_name column contains a list of ATC area names; it is
displayed as a list for succinctness, not because this is the format
your dataframe should be in.
| icao | flight_id | origin_apt | dest_apt | artcc_name |
|---|---|---|---|---|
| str | u32 | str | str | list[str] |
| “abd19e” | 1 | “ATL” | “MCO” | [“ZMA”, “ZJX”, “ZTL”, “ZDC”] |
| “ad000d” | 4 | “RDU” | “ATL” | [“ZTL”, “ZDC”] |
| “a613e7” | 2 | “AUS” | “DEN” | [“ZHU”, “ZAB”, “ZFW”, “ZDV”] |
| “a3e542” | 2 | “PHX” | “ABQ” | [“ZAB”] |
| “ad7282” | 3 | “MCI” | “EWR” | [“ZNY”, “ZOB”, “ZAU”, “ZKC”] |
| “a752ea” | 3 | “ATL” | “BOS” | [“ZNY”, “ZTL”, “ZDC”, “ZBW”] |
| “a5449a” | 1 | “DFW” | “SAN” | [“ZAB”, “ZFW”, “ZLA”] |
| “ac2b80” | 2 | “PDX” | “PHX” | [“ZAB”, “ZLA”, “ZLC”, “ZSE”] |
| “a38ad5” | 4 | “DCA” | “MSY” | [“ZMA”, “ZJX”, “ZDC”, “ZHU”] |
| “a2615e” | 1 | “PDX” | “SFO” | [“ZOA”, “ZSE”] |
Use geopandas to plot the locations of the airports. You can use the
plot command on a GeoDataFrame for this.
Now, use the atc traffic data from Part 2c to generate a plot showing
the number of flights passing through each ATC area. This requires
creating a new data frame by aggregating the flights by ATC area. Then,
use the plot command to generate a choropleth map from the
GeoDataFrame. This is done by choosing a column to serve as the value in
the map and setting a colormap (cmap).
One of the interesting aspects of air traffic control is how flights are handed off between different control centers as they move through the airspace. In Part 2c, we saw the ATCs that each flight passed through, but we did not check if this was optimal. For example, if a flight passed through ATC areas A, B, and C, but it could have gone directly from A to C (skipping B), then we would like to identify this. While our current flights data frame contains the origin and destination airports as well as the ARTCCs that are intersected, we can map these to the origin and destination ATC areas. We will compute the shortest path between these two areas and see which ATC areas are on that path. To do this, we will use a graph database, neo4j, to analyze the paths between ATC areas. This requires path-type queries over a graph of ATC regions connected by edges when they border each other. Note that this is not necessarily optimal routing as going far out of the way to avoid handoffs is not desirable.
First, we need to determine a graph of ATC areas where an edge
indicates that the ATC areas border each other. To do this, we will
compute the spatial join of ATC areas with themselves. There are a
number of operations that can be used for a spatial join, but to make
sure that areas with common borders are joined, we will use
'intersects'. Make sure to get rid of pairs of the same
area and deduplicate those that are listed twice.
To begin, we will create a new graph database and add a Local DBMS to neo4j. Do this in the neo4j desktop application by creating a new project and creating a Local DBMS associated with that project. You may wish to stop other databases from running, and then start the new DBMS you just created. If you click on the newly created DBMS, you will see the IP address and Bolt port; copy the bolt url. Back in Jupyter, we now wish to connect to this DBMS. Use the following code (potentially with a different url as copied) to connect with the password you created:
from neo4j import GraphDatabase
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))Updated 2025-11-26 Since we have less than 100 pairs
of bordering ATC areas, we can insert them all at once. Construct a
Cypher query to insert the nodes and relationships into the database.
The UNWIND
clause will take a list of 2-tuples/lists and can create the nodes and
relationships.
query = """
UNWIND $batch AS edge
// add origin node
// add destination node
// add borders relationship between them
"""
with driver.session() as session:
session.run(query, batch=edges)Once you have created this correctly, you can visualize the graph using the Bloom tool that comes with neo4j Desktop. You can do this using the “Explore” tool.
MERGE
clause in Cypher will create a node or relationship only if it does not
already exist. The CREATE
clause will always create a new node or relationship.Next, we wish to use the graph database to compute the potential handoffs from the flights data. We need to compute the unique starting and ending ATC areas over all of the flights, and then we can compute the shortest path between the areas. For each flight, we wish to find the shortest path between the starting and ending ATC areas. (Note that this path is not unique.) Make sure to exclude the cases where the starting and ending ATC areas are the same, as these do not require any handoffs. From the results of the query, create a dataframe with the counts of the number of times an ATC was part of a flight’s path. Add the count of the flights that started and ended in the same ATC to this dataframe. Finally, take the flights dataframe from Part 2c and combine it with this handoffs information to get the total number of flights that would pass through each ATC area using shortest paths.
SHORTEST
keyword to compute the shortest path for a path expression(n1)-[:EDGE_TYPE]-(n2) is
undirected while
(n1)-[:EDGE_TYPE]->(n2) is directed. We
don’t care which way the edge goes when analyzing a path.=nodes function
returns the list of nodes along that path.reduce function to pull out only some
of each node’s information (e.g. its area number). See this
post for some ideas.Generate a plot showing the number of flights that would pass through each ATC based on the paths computed in Part c. This plot should look similar to Part 2b. What has changed?
Next, we wish to analyze the temporal patterns of flight activity. We can count the number of flights that are in the air at each minute of the day. Note that the data is in UTC time so you do not need to do any timezone conversions.
You can compute the number of active flights by associating the values +1 and -1 with the start and end of each flight, sorting all of these events by time, and then calculating the cumulative sum of these values over time. However, these values will not be binned into regular time intervals. Therefore, you will need to resample the data into regular intervals (try 10-minute and 20-minute intervals). Create plots of this interval data to see any patterns of activity. Where is there a dip in activity? Why might this be? (remember the data is in UTC!) Why is the activity so low at the very beginning and end of the day; is this real or an artifact of the data we are examining?
cum_sum
method to compute cumulative sums.join_asof
method. Make sure to fill in any missing values with the previous
value.