Date, Time, & Location
Wednesday, October 8, 12:30-1:45pm, PM 103
Overview
Test 1 will cover all material from the beginning of the semester
through data cleaning. The material will cover the assigned readings and
the topics we discussed in class.
Topics
- Python
- Relational Algebra
- SQL
- Dataframes
- polars, DuckDB, pandas
- Data (items, attributes, attribute types, semantics, metadata)
- Data Wrangling
- Data Transformation
- Data Cleaning
Readings
Referenced Papers
- Potter’s wheel: An
interactive data cleaning system, V. Raman and J. M. Hellerstein,
2001.
- Self-Service
Data Preparation: Research to Practice, J. M. Hellerstein et al.,
2018
- Transform-Data-by-Example
(TDE): An Extensible Search Engine for Data Transformations, Y. He
et al., 2018
- Foofah:
Transforming Data By Example, Z. Jin et al., 2017
- AutoTables:
Relationalize Tables without Examples, P. Li et al., 2023.
- SampleClean:
Fast and Reliable Analytics on Dirty Data, S. Krishnan et al.,
2015
- Relational
Data Cleaning Meets Artificial Intelligence: A Survey, J. Zhu et
al., 2024
Free Response Example Questions
- Given a dataset, (a) identify an item, attribute, and cell; (b)
state, for each column, whether it is categorical, ordered, or
quantitative.
- What are two differences between dataframes and relational
databases? Which would you use for logging web server activity, and why?
Which would you use for reading in raw data and performing data
cleaning, and why?
- Given a dataframe with information about people, write polars code
to find all people over the age of 30. Find the name of the oldest
person in the dataframe.
- Given two relations with one about instructors and another with
department information, write a relational algebra expression to find
the department offices for all instructors with the title
“Professor”.
- State three distinct ways in which Wrangler helps users trying
wrangle raw datasets.
- What is the importance of Potter’s Wheel for data wrangling? Why
does it show up in many papers?
- How transform data by example (TDE) is different from Wrangler? How
does transform by pattern (TBP) improve on transform by example, and
what use cases does it aid with?
- Given an untidy data frame, identify the problems with it and the
transformations required to make it tidy. Exact syntax is not important,
explain (and be specific) if you do not recall a particular operation or
function name.
- What is the difference between Foofah and Auto-Suggest? Which tasks
does Auto-Suggest support?
- What are the three core tasks in data cleaning? Why is HoloClean
different from prior approaches to data cleaning?
- How does artificial intelligence help with data wrangling? What are
some limitations of current approaches?