CSCI 490/680 - Test 2

Date, Time, & Location

Thursday, April 9, 3:30pm-4:45pm, Online

Overview

Test 2 will cover all material from the beginning of the semester through time series data (Tuesday, April 7) but with a strong focus on material covered since Test 1. The material will cover the assigned readings and the topics we discussed in class.

Format

Multiple Choice (10-15)
Free Response (5-6 questions)
CS680 Students will have additional questions

Topics

Data Integration
Data Fusion
Data Exploration
Dataset Search
Scalable Databases
Data Curation
Graph Data
Time Series Data

Readings

Assigned Readings

Python for Data Analysis (2nd ed.), W. McKinney, Chs. 1-8, 11
Integrating Conflicting Data: The Role of Source Dependence, X. L. Dong et al., 2009
Dataset search: a survey, A. Chapman et al., 2020
Cassandra - A Decentralized Structured Storage System, A. Lakshman and P. Malik, 2009.
Spanner: Google’s Globally-Distributed Database, J. C. Corbett et al., 2012.
The FAIR Guiding Principles for scientific data management and stewardship, M. D. Wilkinson et al., 2016.
A Comparison of Current Graph Database Models, R. Angles, 2012.

Referenced Papers

Goods: Organizing Google’s Datasets, A. Halevy et al., 2016.
Google Dataset Search: Building a search engine for datasets in an open Web ecosystem, N. Noy et al., 2019.
F1: A Distributed SQL Database That Scales, J. Shute et al., 2013.
Spanner, TrueTime & The CAP Theorem, E. Brewer, 2017.
An Introduction to Graph Data Management, R. Angles and C. Gutierrez, 2017

Free Response Example Questions

What is the difference between the concatenation and merge operations in pandas?
In which circumstances would you use a left merge/join instead of an inner merge/join?
What are the two techniques for data integration that we discussed? What are the advantages of each?
Why is record linkage important? What strategies can be used with it?
When merging two datasets, which parts of the data is data fusion concerned with? Why is source dependence important when doing data fusion of many sources?
If you were searching for a gas station with fuel after a hurricane, which data source would you prefer, Twitter (updated all the time) or AAA (who updates 1x/day)?
What is churn with respect to dataset search? Which part of datasets do data curators recommend keeping after the dataset is deleted?
Does Google Dataset Search analyze datasets in order to get metadata? If no, how is it obtained?
Which type of database partitioning (horizontal or vertical) is sharding? Is this used more for OLTP or OLAP?
Which type of database system focuses on ACID?
What are the three parts of CAP in the CAP Theorem? Which type of system is Cassandra (CA, CP, or AP)?
How does Cassandra maintain replicates of its data?
Which type of system is Spanner (SQL, noSQL, NewSQL)? Which type of workloads (OLTP, OLAP) does it focus on? Why are accurate timestamps important?
What are the four types of principles in FAIR data management? Which principle deals with authorization and authentication?
Why does DataCite create a DOI for a dataset? How is this used?
For which types of queries would we expect a graph database to provide better performance than a relational database?
What is unique about RDF triple stores compared to other graph databases with respect to schema and instance?
How can you compare two time series datasets when their timestamps do not match up?