Date, Time, & Location
Monday, April 10, 9:30-10:45am, PM 253
Overview
Test 2 will cover all material from the beginning of the semester
through time series data (Monday, April 3) but with a strong focus on
material covered since Test 1. The material will cover the assigned
readings and the topics we discussed in class.
Emphasized Topics
- Data Cleaning
- Data Transformation
- Data Integration
- Data Fusion
- Scalable Databases
- Scalable Dataframes
- Time Series Data
Readings
Assigned Readings
- pandas (and Python) content covered in class
- Auto-Suggest:
Learning-to-Recommend Data Preparation Steps Using Data Science
Notebooks, C. Yan and Y. He, 2020
- HoloClean:
Holistic Data Repairs with Probabilistic Inference, T. Rekatsinas et
al., 2017
- Data
Integration: The Current Status and the Way Forward, M. Stonebraker
and I. Ilyas, 2018
- Integrating
Conflicting Data: The Role of Source Dependence, X. L. Dong et al.,
2009
- NoSQL
Database Systems: A Survey and Decision Guidance, F. Gessert et al.,
2017.
- What’s
Really New with NewSQL?, A. Pavlo and M. Aslett, 2016
- Towards
Scalable Dataframe Systems, D. Petersohn et al., 2020
- Magpie:
Python at Speed and Scale using Cloud Backends, A. Jindal et al.,
2021
- Gorilla:
a fast, scalable, in-memory time series database, T. Pelkonen et
al., 2015
Referenced Papers
- Foofah:
Transforming Data By Example, Z. Jin et al., 2017
- Tidy Data,
H. Wickham, 2014
- Spreadsheet
Table Transformations from Examples, W. R. Harris and S. Gulwani,
2011.
- SampleClean:
Fast and Reliable Analytics on Dirty Data, S. Krishnan et al.,
2015
- Data
Fusion, J. Bleiholder and F. Naumann, 2008
- Spanner,
TrueTime & The CAP Theorem, E. Brewer, 2017.
- Cassandra
- A Decentralized Structured Storage System, A. Lakshman and P.
Malik, 2009.
- Spanner:
Google’s Globally-Distributed Database, J. C. Corbett et al.,
2012.
- The
Official Ten-Year Retrospective of NewSQL, A. Pavlo, 2021.
- ConnectorX:
accelerating data loading from databases to dataframes, X. Wang et
al., 2022
Free Response Example Questions
- What are the two techniques for data integration that we discussed?
What are the advantages of each?
- Why is record linkage important? What strategies can be used with
it?
- When merging two datasets, which parts of the data is data fusion
concerned with? Why is source dependence important when doing data
fusion of many sources?
- Which type of database partitioning (horizontal or vertical) is
sharding? Is this used more for OLTP or OLAP?
- Which type of database system focuses on ACID?
- What are the three parts of CAP in the CAP Theorem? Which type of
system is Cassandra (CA, CP, or AP)?
- How does Cassandra maintain replicates of its data?
- Which type of system is Spanner (SQL, noSQL, NewSQL)? Which type of
workloads (OLTP, OLAP) does it focus on? Why are accurate timestamps
important?
- How does Modin speed up pandas queries?
- What does Magpie do to improve the scalability of pandas-style
queries?
- Why does Pavlo call NewSQL dead? What didn’t work?
- What is an advantage of dataframe over a database? and vice
versa?
- How much time series data does Gorilla focus on storing? Why? What
tecniques does it use to store this data in memory?