Please note that this schedule is subject to change.
Lectures
(09/03) Relational Databases
(09/08) Relational Databases
(09/17) Polars and DuckDB
(09/29) Data Transformation
- Slides
- Reading:
- References:
- Foofah:
Transforming Data By Example, Z. Jin et al., 2017
- Spreadsheet
Table Transformations from Examples, W. R. Harris and S. Gulwani,
2011.
- Wrex:
A Unifed Programming-by-Example Interaction for Synthesizing Readable
Code for Data Scientists
- Spreadsheet
Data Manipulation Using Examples, S. Gulwani et al., 2012
- Tidy Data,
H. Wickham, 2014
- Component-Based
Synthesis of Table Consolidation and Transformation Tasks from
Examples, Y. Feng et al., 2017
- AutoTables:
Relationalize Tables without Examples, P. Li et al., 2023.
- Relationalizing
Tables with Large Language Models: The Promise and Challenges, Z.
Huang and E. Wu, 2024.
(10/01) Data Cleaning
- Slides
- Reading:
- References:
- Courselets (Python):
(10/13) Data Fusion
- Slides
- Reading:
- References:
- Lakehouse:
A New Generation of Open Platforms that Unify Data Warehousing and
Advanced Analytics, M. Armbrust et al., 2021
- Apache
Iceberg: An Architectural Look Under the Covers, A. Merced,
2022.
- Entity
Resolution, L. Getoor and A. Machanavajjhala, 2012
- Record
Linkage, P. Christen, 2019
- Data
Fusion—Resolving Data Conflicts in Integration, X. L. Dong and F.
Naumann
- Data
Integration and Machine Learning: A Natural Synergy, X. L. Dong and
T. Rekatsinas, 2018
(10/15) Data Fusion
- Slides
- Reading:
- References:
- Courselets (Python):
(10/20) Scalable Databases
(10/22) Scalable Databases
- Slides
- Reading:
- References:
- NewSQL,
A. Pavlo, 2012.
- The
Official Ten-Year Retrospective of NewSQL, A. Pavlo, 2021.
- Spanner:
Google’s Globally-Distributed Database, J. C. Corbett et al.,
2012.
- F1: A
Distributed SQL Database That Scales, J. Shute et al., 2013.
- Spanner,
TrueTime & The CAP Theorem, E. Brewer, 2017.
- A
Critique of the CAP Theorem, M. Kleppmann, 2015
- Is
Scalable OLTP in the Cloud a Solved Problem?, T. Ziegler et al.,
2022
(10/27) Scalable Dataframes
- Slides
- Reading:
- References:
- Scaling
up your pandas workflows with Modin, D. Petersohn, 2022.
- Spark
SQL: Relational Data Processing in Spark, M. Armbrust et al.,
2015
- Flexible
Rule-Based Decomposition and Metadata Independence in Modin: A Parallel
Dataframe System, D. Petersohn et al., 2022
- Blazing
fast dataframes in Python with Polars, J. L. C. Rodríguez, 2022
- Polars at
Scale, R. Vink, 2025.
- Ibis
Overview
- Polars
- Modin
- Vaex
- Dask
(10/29) Scalable Dataframes
(11/05) Graph Data
- Slides
- Reading:
- References:
- Introduction
to Neo4j and Graph Databases, M. D. Allen, 2019.
- Demystifying Graph
Databases: Analysis and Taxonomy of Data Organization, System Designs,
and Graph Queries (preprint), M. Besta et al., 2022.
- Survey
of Graph Database Models, R. Angles and C. Gutierrez, 2008.
- An Introduction to
Graph Data Management, R. Angles and C. Gutierrez, 2017
- Graph
Databases, D. Lembo and R. Rosati, 2015
- Introduction
to Graph Databases, M. De Marzi, 2012
- The
Future Is Big Graphs: A Community View on Graph Processing Systems,
S. Sakr et al., 2021
- The (sorry)
State of Graph Database Systems, P. Boncz, 2022
- A
Roadmap to Graph Analytics, A. Bonifati et al., 2025
(11/12) Databases and Visualization
(11/17) Spatial Data
- Slides
- Reading:
- References:
- Big
Spatial Data Management, A. Eldawy, 2020
- Data
Cubes, J. Han, M. Kamber, and J. Pei, 2011.
- Nanocubes
for Real-Time Exploration of Spatiotemporal Datasets, L. Lins et
al., 2013.
- TopKube:
A Rank-Aware Data Cube for Real-Time Exploration of Spatiotemporal
Datasets, F. Miranda et al., 2017.
- Dynamic
prefetching of data tiles for interactive visualization, L. Battle
et al., 2016.
(11/24) Provenance
- Slides
- Reading:
- References:
- Provenance
for Computational Tasks: A Survey, J. Freire et al., 2008
- Provenance
in Databases: Why, How, and Where, J. Cheney et al., 2007
- Provenance in
Databases, A. Amarilli, 2019.
- Capturing
and querying fine-grained provenance of preprocessing pipelines in data
science, P. Missier, 2023.
- Geopandas Example: Download,
View
(12/01) Reproducibility
- Slides
- Reading:
- References:
- Repeatability
and Benefaction in Computer Systems Research, C. Collberg et al.,
2015.
- Reproducible
Research in Computational Science, R. D. Peng, 2011.
- Ten
Simple Rules for Reproducible Computational Research, G. K. Sandve
et al., 2013.
- Computational
Reproducibility: State-of-the-Art, Challenges, and Database Research
Opportunities, J. Freire et al., 2012.
- A
Large-scale Study about Quality and Reproducibility of Jupyter
Notebooks, J. F. Pimentel, 2019.
(12/03) Databases and Machine Learning