Test 2

Date, Time, & Location

Monday, April 10, 9:30-10:45am, PM 253

Overview

Test 2 will cover all material from the beginning of the semester through time series data (Monday, April 3) but with a strong focus on material covered since Test 1. The material will cover the assigned readings and the topics we discussed in class.

Format

  • Multiple Choice
  • Free Response
  • CS680 Students will have additional questions

Emphasized Topics

  • Data Cleaning
  • Data Transformation
  • Data Integration
  • Data Fusion
  • Scalable Databases
  • Scalable Dataframes
  • Time Series Data

Readings

Assigned Readings

Referenced Papers

Free Response Example Questions

  • What are the two techniques for data integration that we discussed? What are the advantages of each?
  • Why is record linkage important? What strategies can be used with it?
  • When merging two datasets, which parts of the data is data fusion concerned with? Why is source dependence important when doing data fusion of many sources?
  • Which type of database partitioning (horizontal or vertical) is sharding? Is this used more for OLTP or OLAP?
  • Which type of database system focuses on ACID?
  • What are the three parts of CAP in the CAP Theorem? Which type of system is Cassandra (CA, CP, or AP)?
  • How does Cassandra maintain replicates of its data?
  • Which type of system is Spanner (SQL, noSQL, NewSQL)? Which type of workloads (OLTP, OLAP) does it focus on? Why are accurate timestamps important?
  • How does Modin speed up pandas queries?
  • What does Magpie do to improve the scalability of pandas-style queries?
  • Why does Pavlo call NewSQL dead? What didn’t work?
  • What is an advantage of dataframe over a database? and vice versa?
  • How much time series data does Gorilla focus on storing? Why? What tecniques does it use to store this data in memory?