Test 1

Date, Time, & Location

Wednesday, October 8, 12:30-1:45pm, PM 103

Overview

Test 1 will cover all material from the beginning of the semester through data cleaning. The material will cover the assigned readings and the topics we discussed in class.

Format

  • Multiple Choice
  • Free Response
  • CSCI 640 Students will have additional questions

Topics

  • Python
  • Relational Algebra
  • SQL
  • Dataframes
  • polars, DuckDB, pandas
  • Data (items, attributes, attribute types, semantics, metadata)
  • Data Wrangling
  • Data Transformation
  • Data Cleaning

Readings

Assigned Readings

Referenced Papers

Free Response Example Questions

  • Given a dataset, (a) identify an item, attribute, and cell; (b) state, for each column, whether it is categorical, ordered, or quantitative.
  • What are two differences between dataframes and relational databases? Which would you use for logging web server activity, and why? Which would you use for reading in raw data and performing data cleaning, and why?
  • Given a dataframe with information about people, write polars code to find all people over the age of 30. Find the name of the oldest person in the dataframe.
  • Given two relations with one about instructors and another with department information, write a relational algebra expression to find the department offices for all instructors with the title “Professor”.
  • State three distinct ways in which Wrangler helps users trying wrangle raw datasets.
  • What is the importance of Potter’s Wheel for data wrangling? Why does it show up in many papers?
  • How transform data by example (TDE) is different from Wrangler? How does transform by pattern (TBP) improve on transform by example, and what use cases does it aid with?
  • Given an untidy data frame, identify the problems with it and the transformations required to make it tidy. Exact syntax is not important, explain (and be specific) if you do not recall a particular operation or function name.
  • What is the difference between Foofah and Auto-Suggest? Which tasks does Auto-Suggest support?
  • What are the three core tasks in data cleaning? Why is HoloClean different from prior approaches to data cleaning?
  • How does artificial intelligence help with data wrangling? What are some limitations of current approaches?