Test 2

Date, Time, & Location

Monday, April 10, 9:30-10:45am, PM 253

Overview

Test 2 will cover all material from the beginning of the semester through time series data (Monday, April 3) but with a strong focus on material covered since Test 1. The material will cover the assigned readings and the topics we discussed in class.

Format

Multiple Choice
Free Response
CS680 Students will have additional questions

Emphasized Topics

Data Cleaning
Data Transformation
Data Integration
Data Fusion
Scalable Databases
Scalable Dataframes
Time Series Data

Readings

Assigned Readings

pandas (and Python) content covered in class
Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks, C. Yan and Y. He, 2020
HoloClean: Holistic Data Repairs with Probabilistic Inference, T. Rekatsinas et al., 2017
Data Integration: The Current Status and the Way Forward, M. Stonebraker and I. Ilyas, 2018
Integrating Conflicting Data: The Role of Source Dependence, X. L. Dong et al., 2009
NoSQL Database Systems: A Survey and Decision Guidance, F. Gessert et al., 2017.
What’s Really New with NewSQL?, A. Pavlo and M. Aslett, 2016
Towards Scalable Dataframe Systems, D. Petersohn et al., 2020
Magpie: Python at Speed and Scale using Cloud Backends, A. Jindal et al., 2021
Gorilla: a fast, scalable, in-memory time series database, T. Pelkonen et al., 2015

Referenced Papers

Foofah: Transforming Data By Example, Z. Jin et al., 2017
Tidy Data, H. Wickham, 2014
Spreadsheet Table Transformations from Examples, W. R. Harris and S. Gulwani, 2011.
SampleClean: Fast and Reliable Analytics on Dirty Data, S. Krishnan et al., 2015
Data Fusion, J. Bleiholder and F. Naumann, 2008
Spanner, TrueTime & The CAP Theorem, E. Brewer, 2017.
Cassandra - A Decentralized Structured Storage System, A. Lakshman and P. Malik, 2009.
Spanner: Google’s Globally-Distributed Database, J. C. Corbett et al., 2012.
The Official Ten-Year Retrospective of NewSQL, A. Pavlo, 2021.
ConnectorX: accelerating data loading from databases to dataframes, X. Wang et al., 2022

Free Response Example Questions

What are the two techniques for data integration that we discussed? What are the advantages of each?
Why is record linkage important? What strategies can be used with it?
When merging two datasets, which parts of the data is data fusion concerned with? Why is source dependence important when doing data fusion of many sources?
Which type of database partitioning (horizontal or vertical) is sharding? Is this used more for OLTP or OLAP?
Which type of database system focuses on ACID?
What are the three parts of CAP in the CAP Theorem? Which type of system is Cassandra (CA, CP, or AP)?
How does Cassandra maintain replicates of its data?
Which type of system is Spanner (SQL, noSQL, NewSQL)? Which type of workloads (OLTP, OLAP) does it focus on? Why are accurate timestamps important?
How does Modin speed up pandas queries?
What does Magpie do to improve the scalability of pandas-style queries?
Why does Pavlo call NewSQL dead? What didn’t work?
What is an advantage of dataframe over a database? and vice versa?
How much time series data does Gorilla focus on storing? Why? What tecniques does it use to store this data in memory?