Date, Time, & Location
Monday, March 29, 3:30pm-4:45pm, Online (Blackboard)
Overview
Test 2 will cover all material from the beginning of the semester through spatial data (Wednesday, March 24) but with a strong focus on material covered since Test 1. The material will cover the assigned readings and the topics we discussed in class.
Emphasized Topics
- Data Transformation
- Data Integration
- Data Fusion
- Data Exploration
- Dataset Search
- Data Citation
- Scalable Databases
- Graph Data
- Databases and Visualization
- Spatial Data
Readings
Assigned Readings
- Content aligned with recommended text (Python for Data Analysis (2nd ed.), W. McKinney, Chs. 1-8)
- Integrating Conflicting Data: The Role of Source Dependence, X. L. Dong et al., 2009
- Dataset search: a survey, A. Chapman et al., 2020
- Cassandra - A Decentralized Structured Storage System, A. Lakshman and P. Malik, 2009.
- Spanner: Google’s Globally-Distributed Database, J. C. Corbett et al., 2012.
- The FAIR Guiding Principles for scientific data management and stewardship, M. D. Wilkinson et al., 2016.
- A Comparison of Current Graph Database Models, R. Angles, 2012.
- imMens: Real-time Visual Querying of Big Data, Z. Liu et al., 2013
- Nanocubes for Real-Time Exploration of Spatiotemporal Datasets, L. Lins et al., 2013.
Referenced Papers
- Potter’s wheel: An interactive data cleaning system, V. Raman and J. M. Hellerstein, 2001.
- Self-Service Data Preparation: Research to Practice, J. M. Hellerstein et al., 2018
- Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations, Y. He et al., 2018
- Google Dataset Search: Building a search engine for datasets in an open Web ecosystem, N. Noy et al., 2019.
- Who Shares? Who Doesn’t? Factors Associated with Openly Archiving Raw Research Data, H. Piwowar, 2011
- Why data citation is a computational problem, P. Buneman et al., 2016.
- Spanner, TrueTime & The CAP Theorem, E. Brewer, 2017.
- Falcon: Balancing Interactive Latency and Resolution Sensitivity for Scalable Linked Visualizations, D. Moritz et al., 2019
- Dynamic prefetching of data tiles for interactive visualization, L. Battle et al., 2016.
Free Response Example Questions
- What is the difference between the concatenation and merge operations in pandas?
- In which circumstances would you use a left merge/join instead of an inner merge/join?
- What are the two techniques for data integration that we discussed? What are the advantages of each?
- Why is record linkage important? What strategies can be used with it?
- When merging two datasets, which parts of the data is data fusion concerned with? Why is source dependence important when doing data fusion of many sources?
- If you were searching for a gas station with fuel after a hurricane, which data source would you prefer, Twitter (updated all the time) or AAA (who updates 1x/day)?
- What is churn with respect to dataset search? Which part of datasets do data curators recommend keeping after the dataset is deleted?
- Does Google Dataset Search analyze datasets in order to get metadata? If no, how is it obtained?
- Which type of database partitioning (horizontal or vertical) is sharding? Is this used more for OLTP or OLAP?
- Which type of database system focuses on ACID?
- What are the three parts of CAP in the CAP Theorem? Which type of system is Cassandra (CA, CP, or AP)?
- How does Cassandra maintain replicates of its data?
- Which type of system is Spanner (SQL, noSQL, NewSQL)? Which type of workloads (OLTP, OLAP) does it focus on? Why are accurate timestamps important?
- What are the four types of principles in FAIR data management? Which principle deals with authorization and authentication?
- Why does DataCite create a DOI for a dataset? How is this used?
- For which types of queries would we expect a graph database to provide better performance than a relational database?
- What is unique about RDF triple stores compared to other graph databases with respect to schema and instance?
- Why does imMens focus on being extremely efficient in calculating visualization updates?
- What types of operations does Falcon prioritize for lower latency? Why?
- What feature of many spatial datasets does Nanocubes take advantage of to reduce the size of the stored data?
- In addition to doing pre-computation, how does ForeCache reduce interaction latency?