Final Exam

Date, Time, & Location

Wednesday, May 8, 8:00-9:50am, PM 252

The final exam is comprehensive and will cover all material from the beginning of the semester through the end but with some focus on material covered since Test 2. The material will cover the assigned readings and the topics we discussed in class.

Format

Multiple Choice (20-25)
Free Response (5-6 questions)
CSCI 640 Students will have additional questions

Topics

Python
numpy
pandas
Data (items, attributes, attribute types, semantics, metadata)
Data Wrangling
Data Cleaning
Data Transformation
Data Integration
Data Fusion
Scalable Databases
Scalable Dataframes
Time Series Data
Graph Data
Databases and Visualization
Spatial Data
Data Curation
Provenance (Computational, Database, Evolution)
Reproducibility
Databases and Machine Learning

Readings

Assigned Readings

Content aligned with recommended text (Python for Data Analysis (2nd ed.), W. McKinney, Chs. 1-8, 11)
Wrangler: Interactive Visual Specification of Data Transformation Scripts, S. Kandel et al., 2011
Auto-Transform: Learning-to-Transform by Patterns, Z. Jin et al., 2020
Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks, C. Yan and Y. He, 2020
HoloClean: Holistic Data Repairs with Probabilistic Inference, T. Rekatsinas et al., 2017
Data Integration: The Current Status and the Way Forward, M. Stonebraker and I. Ilyas, 2018
Integrating Conflicting Data: The Role of Source Dependence, X. L. Dong et al., 2009
NoSQL Database Systems: A Survey and Decision Guidance, F. Gessert et al., 2017.
What’s Really New with NewSQL?, A. Pavlo and M. Aslett, 2016
Towards Scalable Dataframe Systems, D. Petersohn et al., 2020
Magpie: Python at Speed and Scale using Cloud Backends, A. Jindal et al., 2021
Gorilla: a fast, scalable, in-memory time series database, T. Pelkonen et al., 2015
The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing, S. Sahu et al., 2017
Mosaic: An Architecture for Scalable & Interoperable Data Views, J. Heer and D. Moritz, 2024
Beast: Scalable Exploratory Analytics on Spatio-temporal Data, A. Eldawy et al., 2021
The FAIR Guiding Principles for scientific data management and stewardship, M. D. Wilkinson et al., 2016.
Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science, A. Chapman et al., 2020
Repeatability in computer systems research, C. Collberg and T. A. Proebsting, 2016.
SageDB: A Learned Database System, T. Kraska et al., 2019

Referenced Papers

Potter’s wheel: An interactive data cleaning system, V. Raman and J. M. Hellerstein, 2001.
Self-Service Data Preparation: Research to Practice, J. M. Hellerstein et al., 2018
Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations, Y. He et al., 2018
Foofah: Transforming Data By Example, Z. Jin et al., 2017
Tidy Data, H. Wickham, 2014
Spreadsheet Table Transformations from Examples, W. R. Harris and S. Gulwani, 2011.
SampleClean: Fast and Reliable Analytics on Dirty Data, S. Krishnan et al., 2015
Data Fusion, J. Bleiholder and F. Naumann, 2008
Spanner, TrueTime & The CAP Theorem, E. Brewer, 2017.
Cassandra - A Decentralized Structured Storage System, A. Lakshman and P. Malik, 2009.
Spanner: Google’s Globally-Distributed Database, J. C. Corbett et al., 2012.
The Official Ten-Year Retrospective of NewSQL, A. Pavlo, 2021.
ConnectorX: accelerating data loading from databases to dataframes, X. Wang et al., 2022
imMens: Real-time Visual Querying of Big Data, Z. Liu et al., 2013
Dynamic prefetching of data tiles for interactive visualization, L. Battle et al., 2016.
Nanocubes for Real-Time Exploration of Spatiotemporal Datasets, L. Lins et al., 2013.
Provenance for Computational Tasks: A Survey, J. Freire et al., 2008
Provenance in Databases: Why, How, and Where, J. Cheney et al., 2007
A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks, J. F. Pimentel, 2019.
Bao: Making Learned Query Optimization Practical, R. Marcus et al., 2021

Free Response Example Questions

Examples from Test 1
Examples from Test 2
For which types of queries would we expect a graph database to provide better performance than a relational database?
What did Sanu et al.’s survey about graph datasets find about their sizes? What are the problems related to graph databases that Boncz discusses?
What is unique about RDF triple stores compared to other graph databases with respect to schema and instance?
Why does imMens focus on being extremely efficient in calculating visualization updates?
How does Mosaic produce visualizations more efficiently? (Hint: think about the number of pixels.)
In addition to doing pre-computation, how does ForeCache reduce interaction latency?
What challenges does Beast tackle with respect to spatial data processing? How does its architecture relate to other appoaches?
What is provenance, and what is required to capture, store, and use provenance?
What is the difference between prospective and retrospective provenance?
What are the trade-offs between workflow- and OS-based provenance capture?
What questions can database provenance answer? What are the differences between “Why”, “How”, and “Where” provenance?
What was evolution provenance in VisTrails used for?
What are concerns involved in reproducing a previous computational study?
What types of analyses can be done to evaluate how reproducible published work is?
How might machine learning impact databases?
Which type of engine (OLAP or OLTP) is SageDB being developed for? Which components of a database does SageDB present machine learning approaches for? How do they perform versus standard databases.