Date, Time, & Location
Wednesday, April 6, 3:30pm-4:45pm, PM 153
Overview
Test 2 will cover all material from the beginning of the semester
through scalable databases (Wednesday, March 30) but with a strong focus
on material covered since Test 1. The material will cover the assigned
readings and the topics we discussed in class.
Emphasized Topics
- Data Cleaning
- Data Transformation
- Data Integration
- Data Fusion
- Data Curation
- Data Citation
- Scalable Databases
Readings
Assigned Readings
- pandas (and Python) content covered in class
- Foofah:
Transforming Data By Example, Z. Jin et al., 2017
- Integrating
Conflicting Data: The Role of Source Dependence, X. L. Dong et al.,
2009
- The FAIR
Guiding Principles for scientific data management and stewardship,
M. D. Wilkinson et al., 2016.
- Cassandra
- A Decentralized Structured Storage System, A. Lakshman and P.
Malik, 2009.
- Spanner:
Google’s Globally-Distributed Database, J. C. Corbett et al.,
2012.
Referenced Papers
- Spreadsheet
Table Transformations from Examples, W. R. Harris and S. Gulwani,
2011.
- Transform-Data-by-Example
(TDE): An Extensible Search Engine for Data Transformations, Y. He
et al., 2018
- Tidy Data,
H. Wickham, 2014
- Who
Shares? Who Doesn’t? Factors Associated with Openly Archiving Raw
Research Data, H. Piwowar, 2011
- Why
data citation is a computational problem, P. Buneman et al.,
2016.
- Spanner,
TrueTime & The CAP Theorem, E. Brewer, 2017.
- F1: A
Distributed SQL Database That Scales, J. Shute et al., 2013.
Free Response Example Questions
- Compare Foofah’s example-based data cleaning with Wrangler’s
interactive data cleaning.
- In which circumstances would you use a left merge/join instead of an
inner merge/join?
- What are the two techniques for data integration that we discussed?
What are the advantages of each?
- Why is record linkage important? What strategies can be used with
it?
- When merging two datasets, which parts of the data is data fusion
concerned with? Why is source dependence important when doing data
fusion of many sources?
- What are the four types of principles in FAIR data management? Which
principle deals with authorization and authentication?
- Why does DataCite create a DOI for a dataset? How is this used?
- Which type of database partitioning (horizontal or vertical) is
sharding? Is this used more for OLTP or OLAP?
- Which type of database system focuses on ACID?
- What are the three parts of CAP in the CAP Theorem? Which type of
system is Cassandra (CA, CP, or AP)?
- How does Cassandra maintain replicates of its data?
- Which type of system is Spanner (SQL, noSQL, NewSQL)? Which type of
workloads (OLTP, OLAP) does it focus on? Why are accurate timestamps
important?