CSCI 680/490 - Test 2

Date, Time, & Location

Monday, March 29, 3:30pm-4:45pm, Online (Blackboard)

Overview

Test 2 will cover all material from the beginning of the semester through spatial data (Wednesday, March 24) but with a strong focus on material covered since Test 1. The material will cover the assigned readings and the topics we discussed in class.

Format

Multiple Choice
Free Response
CS680 Students will have additional questions

Emphasized Topics

Data Transformation
Data Integration
Data Fusion
Data Exploration
Dataset Search
Data Citation
Scalable Databases
Graph Data
Databases and Visualization
Spatial Data

Readings

Assigned Readings

Content aligned with recommended text (Python for Data Analysis (2nd ed.), W. McKinney, Chs. 1-8)
Integrating Conflicting Data: The Role of Source Dependence, X. L. Dong et al., 2009
Dataset search: a survey, A. Chapman et al., 2020
Cassandra - A Decentralized Structured Storage System, A. Lakshman and P. Malik, 2009.
Spanner: Google’s Globally-Distributed Database, J. C. Corbett et al., 2012.
The FAIR Guiding Principles for scientific data management and stewardship, M. D. Wilkinson et al., 2016.
A Comparison of Current Graph Database Models, R. Angles, 2012.
imMens: Real-time Visual Querying of Big Data, Z. Liu et al., 2013
Nanocubes for Real-Time Exploration of Spatiotemporal Datasets, L. Lins et al., 2013.

Referenced Papers

Potter’s wheel: An interactive data cleaning system, V. Raman and J. M. Hellerstein, 2001.
Self-Service Data Preparation: Research to Practice, J. M. Hellerstein et al., 2018
Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations, Y. He et al., 2018
Google Dataset Search: Building a search engine for datasets in an open Web ecosystem, N. Noy et al., 2019.
Who Shares? Who Doesn’t? Factors Associated with Openly Archiving Raw Research Data, H. Piwowar, 2011
Why data citation is a computational problem, P. Buneman et al., 2016.
Spanner, TrueTime & The CAP Theorem, E. Brewer, 2017.
Falcon: Balancing Interactive Latency and Resolution Sensitivity for Scalable Linked Visualizations, D. Moritz et al., 2019
Dynamic prefetching of data tiles for interactive visualization, L. Battle et al., 2016.

Free Response Example Questions

What is the difference between the concatenation and merge operations in pandas?
In which circumstances would you use a left merge/join instead of an inner merge/join?
What are the two techniques for data integration that we discussed? What are the advantages of each?
Why is record linkage important? What strategies can be used with it?
When merging two datasets, which parts of the data is data fusion concerned with? Why is source dependence important when doing data fusion of many sources?
If you were searching for a gas station with fuel after a hurricane, which data source would you prefer, Twitter (updated all the time) or AAA (who updates 1x/day)?
What is churn with respect to dataset search? Which part of datasets do data curators recommend keeping after the dataset is deleted?
Does Google Dataset Search analyze datasets in order to get metadata? If no, how is it obtained?
Which type of database partitioning (horizontal or vertical) is sharding? Is this used more for OLTP or OLAP?
Which type of database system focuses on ACID?
What are the three parts of CAP in the CAP Theorem? Which type of system is Cassandra (CA, CP, or AP)?
How does Cassandra maintain replicates of its data?
Which type of system is Spanner (SQL, noSQL, NewSQL)? Which type of workloads (OLTP, OLAP) does it focus on? Why are accurate timestamps important?
What are the four types of principles in FAIR data management? Which principle deals with authorization and authentication?
Why does DataCite create a DOI for a dataset? How is this used?
For which types of queries would we expect a graph database to provide better performance than a relational database?
What is unique about RDF triple stores compared to other graph databases with respect to schema and instance?
Why does imMens focus on being extremely efficient in calculating visualization updates?
What types of operations does Falcon prioritize for lower latency? Why?
What feature of many spatial datasets does Nanocubes take advantage of to reduce the size of the stored data?
In addition to doing pre-computation, how does ForeCache reduce interaction latency?