Final Exam

Date, Time, & Location

Monday, May 9, 4:00pm-5:50pm, PM 153

The final exam is comprehensive and will cover all material from the beginning of the semester through the end but with some focus on material covered since Test 2. The material will cover the assigned readings and the topics we discussed in class.

Format

Multiple Choice (20-25)
Free Response (5-6 questions)
CS680 Students will have additional questions

Topics

Python
numpy
pandas
Data (items, attributes, attribute types, semantics, metadata)
Data Wrangling
Data Cleaning
Data Transformation
Data Integration
Data Fusion
Data Exploration
Dataset Search
Scalable Databases
Data Curation
Graph Data
Databases and Visualization
Spatial Data
Time Series Data
Provenance (Computational, Database, Evolution)
Reproducibility
Databases and Machine Learning

Readings

Assigned Readings

Content aligned with recommended text (Python for Data Analysis (2nd ed.), W. McKinney, Chs. 1-8, 11)
Wrangler: Interactive Visual Specification of Data Transformation Scripts, S. Kandel et al., 2011
Foofah: Transforming Data By Example, Z. Jin et al., 2017
Integrating Conflicting Data: The Role of Source Dependence, X. L. Dong et al., 2009
Dataset search: a survey, A. Chapman et al., 2020
Cassandra - A Decentralized Structured Storage System, A. Lakshman and P. Malik, 2009.
Spanner: Google’s Globally-Distributed Database, J. C. Corbett et al., 2012.
The FAIR Guiding Principles for scientific data management and stewardship, M. D. Wilkinson et al., 2016.
A Comparison of Current Graph Database Models, R. Angles, 2012.
imMens: Real-time Visual Querying of Big Data, Z. Liu et al., 2013
Nanocubes for Real-Time Exploration of Spatiotemporal Datasets, L. Lins et al., 2013.
Provenance for Computational Tasks: A Survey, J. Freire et al., 2008
Reproducibility Using VisTrails, J. Freire et al., 2013.
Repeatability in computer systems research, C. Collberg and T. A. Proebsting, 2016.
SageDB: A Learned Database System, T. Kraska et al., 2019

Referenced Papers

Potter’s wheel: An interactive data cleaning system, V. Raman and J. M. Hellerstein, 2001.
Self-Service Data Preparation: Research to Practice, J. M. Hellerstein et al., 2018
Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations, Y. He et al., 2018
Google Dataset Search: Building a search engine for datasets in an open Web ecosystem, N. Noy et al., 2019.
Who Shares? Who Doesn’t? Factors Associated with Openly Archiving Raw Research Data, H. Piwowar, 2011
Why data citation is a computational problem, P. Buneman et al., 2016.
Spanner, TrueTime & The CAP Theorem, E. Brewer, 2017.
Falcon: Balancing Interactive Latency and Resolution Sensitivity for Scalable Linked Visualizations, D. Moritz et al., 2019
Dynamic prefetching of data tiles for interactive visualization, L. Battle et al., 2016.
Provenance in Databases: Why, How, and Where, J. Cheney et al., 2007
A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks, J. F. Pimentel, 2019.

Free Response Example Questions

Examples from Test 1
Examples from Test 2
For which types of queries would we expect a graph database to provide better performance than a relational database?
What is unique about RDF triple stores compared to other graph databases with respect to schema and instance?
Why does imMens focus on being extremely efficient in calculating visualization updates?
What types of operations does Falcon prioritize for lower latency? Why?
What feature of many spatial datasets does Nanocubes take advantage of to reduce the size of the stored data?
In addition to doing pre-computation, how does ForeCache reduce interaction latency?
How can you compare two time series datasets when their timestamps do not match up?
What is provenance, and what is required to capture, store, and use provenance?
What is the difference between prospective and retrospective provenance?
What are the trade-offs between workflow- and OS-based provenance capture?
What questions can database provenance answer? What are the differences between “Why”, “How”, and “Where” provenance?
What was evolution provenance in VisTrails used for?
What are concerns involved in reproducing a previous computational study?
What types of analyses can be done to evaluate how reproducible published work is?
How might machine learning impact databases?
Which type of engine (OLAP or OLTP) is SageDB being developed for? Which components of a database does SageDB present machine learning approaches for? How do they perform versus standard databases.