Assignment 2

Goals

The goal of this assignment is to learn how to use pandas and SQL (DuckDB) for data analysis tasks.

Instructions

You may choose to work on this assignment on a hosted environment (e.g. Google Colab) or on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab. For this assignment, you will need to make sure the pandas and python-duckdb packages are installed. From conda, this can be accomplished with conda install -c conda-forge pandas python-duckdb.

In this assignment, we will again analyze the Metropolitan Museum of Art Dataset which tracks information about artworks in its collection. This dataset is provided by The Metropolitan Museum of Art (aka The Met) and information about the columns is documented here. We will be working on a subset of this data, available here. The attributes of the data are:

  1. Object Number: accession number
  2. Object ID: identifying number
  3. Department: curatorial department responsible for the artwork
  4. AccessionYear: year the artwork was acquired
  5. Object Name: describes the physical type of the object
  6. Title: title given to a work of art
  7. Culture: information about the culture, or people from which an object was created
  8. Period: time or time period when an object was created
  9. Object Date: year or a span of years describing the time when an artwork was designed or created
  10. Object Begin Date: machine readable date indicating the year the artwork was started or created
  11. Object End Date: machine readable date indicating the year the artwork was completed
  12. Artist Display Name: artist name in the correct order for display

You will do the same analysis as in Assignment 1, but now using pandas and SQL. Make sure to properly label each problem’s code and solution. Use a Markdown cell to add a header denoting your work for a particular problem. Make sure to document what your answer is to the question, and make sure your code actually computes that.

Due Date

The assignment is due at 11:59pm on Friday, February 18.

Submission

You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a2.ipynb.

Details

1. Pandas

a. Date of the Oldest Object (10 pts)

Read in the data file. Then, write code that computes the date range of the oldest objects. You can use the “Object Begin Date” attribute to make the necessary comparisons, but report the more human-readable “Object Date” attribute.

Hints:
  • Pandas has a read_csv method
  • Remember that you can sort rows based on particular attributes
  • Do not use loops over all the data

b. Number of Unique Artist Names (10 pts)

Write code that computes the number of unique artist names contained in the dataset. This is stored in the “Artist Display Name” attribute. Note that some objects have multiple artists, delimited by the | character, and you need to consider each of them as an individual name. For this part, any anonymous or unknown names should be counted.

Hints:
  • Pandas has a number of string methods which are helpful in processing a column of strings with a single command. Remember to add the str. prefix to the method.
  • Pandas also has methods to count the unique number of items
  • Consider persisting this series for parts c and d

c. Number of Unique Non-Anonymous Artist Names (10 pts)

Now, write code that computes the number of unique, non-anonymous artist names. This means eliminating those names from Part 1 that start with “Anonymous” and “Unknown”. Note that there are artist names like “Anonymous, German, 17th century” that also are anonymous and should be removed from the number.

Hints:

d. Most Frequent Artist Name (10 pts)

Write code that computes the most frequent artist name. Your effort in Part b can be reused for this part.

Hints:
  • Pandas has a method to count value occurrences
  • Remember to sort any value counts

e. Date of the Oldest Accession (10 pts)

Write code that computes the object(s) that was acquired first based on the “AccessionYear” attribute. Note that some accession years are not years (i.e. not a four-digit number), and for this assignment, ignore those entries. Print the Title attribute for the oldest object(s).

Hints:
  • You can convert a string to an integer using pd.to_numeric
  • Note that to_numeric has specific options to control what happens when a string cannot be converted. Use the appropriate option.

2. DuckDB

DuckDB has a couple of APIs for interacting with databases: the Python DB-API 2.0 and a relational API (docs). Please use the DB-API for this assignment. In addition, DuckDB supports transferring data to and from pandas dataframes. Again, for this assignment, all computation must be done via SQL. To connect to a new in-memory database, use duckdb.connect(). To execute a SQL statement, use duckdb.execute, and to retrieve any results, use fetchall or fetchone after the execute call. You should be able to write the solutions to all parts using one or two SQL statements (parts a and b both require a statement to create a table)).

a. Date of the Oldest Object (10 pts)

Read in the data file to a new table. Then, write code that computes the date range of the oldest objects. You can use the “Object Begin Date” attribute to make the necessary comparisons, but report the more human-readable “Object Date” attribute.

Hints:
  • DuckDB has built-in support for CSV Loading
  • You can find the minimum value from a column as a sub-select that can then be used in a comparison

b. Number of Unique Artist Names (10 pts)

Write code that computes the number of unique artist names contained in the dataset. The names are stored in the “Artist Display Name” attribute, but some objects have multiple artists, delimited by the | character, and you need to consider each of them as an individual name. For this part, any anonymous or unknown names should be counted. Your code here should create a new table with just artist names to assist with parts b, c, and d.

Hints:
  • DuckDB supports the CREATE TABLE ... AS (SELECT ... ) syntax to create a table from a select statement
  • DuckDB supports UNNEST and has a number of text functions
  • Rename the single column in the artist table
  • Recall the keyword required to count unique occurrences in SQL

c. Number of Unique Non-Anonymous Artist Names (10 pts)

Now, write code that computes the number of unique, non-anonymous artist names. This means eliminating those names from Part 1 that start with “Anonymous” and “Unknown”. Note that there are artist names like “Anonymous, German, 17th century” that also are anonymous and should be removed from the number.

Hints:
  • Again, refer to the text functions.
  • Also, recall the keyword required to count unique occurrences in SQL
  • You can negate a statement using NOT

d. Most Frequent Artist Name (10 pts)

Write code that computes the most frequent artist name.

Hints:
  • To count, consider using the GROUP BY clause
  • To sort, consider using the ORDER BY clause

e. Date of the Oldest Accession (10 pts)

Write code that computes the object(s) that was acquired first based on the “AccessionYear” attribute. Note that some accession years are not years (i.e. not a four-digit number), and for this assignment, ignore those entries. Print the Title attribute for the oldest object(s). Note that you need to cast to receive full credit for this question, even though it may not be necessary.

Hints:
  • You can convert a string to an integer by casting it. DuckDB also supports the TRY_CAST variant
  • Remember that you can use a sub-select in a where clause to locate the row(s) that match the desired criteria.