Date, Time, & Location

Monday, March 29, 3:30pm-4:45pm, Online (Blackboard)

Overview

Test 2 will cover all material from the beginning of the semester through spatial data (Wednesday, March 24) but with a strong focus on material covered since Test 1. The material will cover the assigned readings and the topics we discussed in class.

Format

  • Multiple Choice
  • Free Response
  • CS680 Students will have additional questions

Emphasized Topics

  • Data Transformation
  • Data Integration
  • Data Fusion
  • Data Exploration
  • Dataset Search
  • Data Citation
  • Scalable Databases
  • Graph Data
  • Databases and Visualization
  • Spatial Data

Readings

Assigned Readings

Referenced Papers

Free Response Example Questions

  • What is the difference between the concatenation and merge operations in pandas?
  • In which circumstances would you use a left merge/join instead of an inner merge/join?
  • What are the two techniques for data integration that we discussed? What are the advantages of each?
  • Why is record linkage important? What strategies can be used with it?
  • When merging two datasets, which parts of the data is data fusion concerned with? Why is source dependence important when doing data fusion of many sources?
  • If you were searching for a gas station with fuel after a hurricane, which data source would you prefer, Twitter (updated all the time) or AAA (who updates 1x/day)?
  • What is churn with respect to dataset search? Which part of datasets do data curators recommend keeping after the dataset is deleted?
  • Does Google Dataset Search analyze datasets in order to get metadata? If no, how is it obtained?
  • Which type of database partitioning (horizontal or vertical) is sharding? Is this used more for OLTP or OLAP?
  • Which type of database system focuses on ACID?
  • What are the three parts of CAP in the CAP Theorem? Which type of system is Cassandra (CA, CP, or AP)?
  • How does Cassandra maintain replicates of its data?
  • Which type of system is Spanner (SQL, noSQL, NewSQL)? Which type of workloads (OLTP, OLAP) does it focus on? Why are accurate timestamps important?
  • What are the four types of principles in FAIR data management? Which principle deals with authorization and authentication?
  • Why does DataCite create a DOI for a dataset? How is this used?
  • For which types of queries would we expect a graph database to provide better performance than a relational database?
  • What is unique about RDF triple stores compared to other graph databases with respect to schema and instance?
  • Why does imMens focus on being extremely efficient in calculating visualization updates?
  • What types of operations does Falcon prioritize for lower latency? Why?
  • What feature of many spatial datasets does Nanocubes take advantage of to reduce the size of the stored data?
  • In addition to doing pre-computation, how does ForeCache reduce interaction latency?