Date, Time, & Location
Tuesday, May 5, 4:00pm-5:50pm, Online
Overview
The final exam is comprehensive and will cover all material from the beginning of the semester through the end but with some focus on material covered since Test 2. The material will cover the assigned readings and the topics we discussed in class.
Topics
- Python
- numpy
- pandas
- Data (items, attributes, attribute types, semantics, metadata)
- Data Wrangling
- Data Cleaning
- Data Transformation
- Data Integration
- Data Fusion
- Data Exploration
- Dataset Search
- Scalable Databases
- Data Curation
- Graph Data
- Time Series Data
- Spatial Data
- Provenance (Computational, Database, Evolution)
- Reproducibility
- Databases and Machine Learning
Readings
Assigned Readings
- Python for Data Analysis (2nd ed.), W. McKinney, Chs. 1-8, 11
- Wrangler: Interactive Visual Specification of Data Transformation Scripts, S. Kandel et al., 2011
- Foofah: Transforming Data By Example, Z. Jin et al., 2017
- Integrating Conflicting Data: The Role of Source Dependence, X. L. Dong et al., 2009
- Dataset search: a survey, A. Chapman et al., 2020
- Cassandra - A Decentralized Structured Storage System, A. Lakshman and P. Malik, 2009.
- Spanner: Google’s Globally-Distributed Database, J. C. Corbett et al., 2012.
- The FAIR Guiding Principles for scientific data management and stewardship, M. D. Wilkinson et al., 2016.
- A Comparison of Current Graph Database Models, R. Angles, 2012.
- Nanocubes for Real-Time Exploration of Spatiotemporal Datasets, L. Lins et al., 2013.
- Provenance for Computational Tasks: A Survey, J. Freire et al., 2008
- Reproducibility Using VisTrails, J. Freire et al., 2013.
- Repeatability in computer systems research, C. Collberg and T. A. Proebsting, 2016.
- SageDB: A Learned Database System, T. Kraska et al., 2019
Referenced Papers
- Potter’s wheel: An interactive data cleaning system, V. Raman and J. M. Hellerstein, 2001.
- Self-Service Data Preparation: Research to Practice, J. M. Hellerstein et al., 2018
- Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations, Y. He et al., 2018
- Goods: Organizing Google’s Datasets, A. Halevy et al., 2016.
- Google Dataset Search: Building a search engine for datasets in an open Web ecosystem, N. Noy et al., 2019.
- F1: A Distributed SQL Database That Scales, J. Shute et al., 2013.
- Spanner, TrueTime & The CAP Theorem, E. Brewer, 2017.
- An Introduction to Graph Data Management, R. Angles and C. Gutierrez, 2017
- TopKube: A Rank-Aware Data Cube for Real-Time Exploration of Spatiotemporal Datasets, F. Miranda et al., 2017.
- Dynamic prefetching of data tiles for interactive visualization, L. Battle et al., 2016.
- Provenance in Databases: Why, How, and Where, J. Cheney et al., 2007
- A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks, J. F. Pimentel, 2019.
Free Response Example Questions
- Examples from Test 1
- Examples from Test 2
- What feature of many spatial datasets do Nanocubes and TopKube take advantage of to reduce the size of the stored data?
- What is the main difference between the types of analyses that Nanocubes and TopKube support? How does that change their implementations?
- In addition to doing pre-computation, how does ForeCache reduce interaction latency?
- What is provenance, and what is required to capture, store, and use provenance?
- What is the difference between prospective and retrospective provenance?
- What are the trade-offs between workflow- and OS-based provenance capture?
- What questions can database provenance answer? What are the differences between “Why”, “How”, and “Where” provenance?
- What was evolution provenance in VisTrails used for?
- What are concerns involved in reproducing a previous computational study?
- What types of analyses can be done to evaluate how reproducible published work is?
- How might machine learning impact databases?
- Which type of engine (OLAP or OLTP) is SageDB being developed for? Which components of a database does SageDB present machine learning approaches for? How do they perform versus standard databases.