CSCI 680/490 – Assignment 2

Goals

The goal of this assignment is to learn how to use pandas and SQL (DuckDB) for data analysis tasks.

Instructions

You may choose to work on this assignment on a hosted environment (e.g. Google Colab) or on your own local installation of Jupyter and Python. You should use Python 3.8 or higher for your work. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab. For this assignment, you will need to make sure the pandas and python-duckdb packages are installed. From conda, this can be accomplished with conda install -c conda-forge pandas python-duckdb.

In this assignment, we will again analyze the Metropolitan Museum of Art Dataset which tracks information about artworks in its collection. This dataset is provided by The Metropolitan Museum of Art (aka The Met) and information about the columns is documented here. We will be working on a subset of this data, available here. The attributes of the data are:

Object Number: accession number
Object ID: identifying number
Department: curatorial department responsible for the artwork
AccessionYear: year the artwork was acquired
Object Name: describes the physical type of the object
Title: title given to a work of art
Culture: information about the culture, or people from which an object was created
Period: time or time period when an object was created
Object Date: year or a span of years describing the time when an artwork was designed or created
Object Begin Date: machine readable date indicating the year the artwork was started or created
Object End Date: machine readable date indicating the year the artwork was completed
Artist Display Name: artist name in the correct order for display

You will do the same analysis as in Assignment 1, but now using pandas and SQL. Make sure to properly label each problem’s code and solution. Use a Markdown cell to add a header denoting your work for a particular problem. Make sure to document what your answer is to the question, and make sure your code actually computes that.

Due Date

The assignment is due at 11:59pm on Friday, February 18.

Submission

You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a2.ipynb.

Details

1. Pandas

a. Date of the Oldest Object (10 pts)

Read in the data file. Then, write code that computes the date range of the oldest objects. You can use the “Object Begin Date” attribute to make the necessary comparisons, but report the more human-readable “Object Date” attribute.

Hints:

Pandas has a read_csv method
Remember that you can sort rows based on particular attributes
Do not use loops over all the data

b. Number of Unique Artist Names (10 pts)

Write code that computes the number of unique artist names contained in the dataset. This is stored in the “Artist Display Name” attribute. Note that some objects have multiple artists, delimited by the | character, and you need to consider each of them as an individual name. For this part, any anonymous or unknown names should be counted.

Hints:

Pandas has a number of string methods which are helpful in processing a column of strings with a single command. Remember to add the str. prefix to the method.
Pandas also has methods to count the unique number of items
Consider persisting this series for parts c and d

c. Number of Unique Non-Anonymous Artist Names (10 pts)

Now, write code that computes the number of unique, non-anonymous artist names. This means eliminating those names from Part 1 that start with “Anonymous” and “Unknown”. Note that there are artist names like “Anonymous, German, 17th century” that also are anonymous and should be removed from the number.

Hints:

Again, refer to the string methods.

d. Most Frequent Artist Name (10 pts)

Write code that computes the most frequent artist name. Your effort in Part b can be reused for this part.

Hints:

Pandas has a method to count value occurrences
Remember to sort any value counts

e. Date of the Oldest Accession (10 pts)

Write code that computes the object(s) that was acquired first based on the “AccessionYear” attribute. Note that some accession years are not years (i.e. not a four-digit number), and for this assignment, ignore those entries. Print the Title attribute for the oldest object(s).

Hints:

You can convert a string to an integer using pd.to_numeric
Note that to_numeric has specific options to control what happens when a string cannot be converted. Use the appropriate option.

2. DuckDB

DuckDB has a couple of APIs for interacting with databases: the Python DB-API 2.0 and a relational API (docs). Please use the DB-API for this assignment. In addition, DuckDB supports transferring data to and from pandas dataframes. Again, for this assignment, all computation must be done via SQL. To connect to a new in-memory database, use duckdb.connect(). To execute a SQL statement, use duckdb.execute, and to retrieve any results, use fetchall or fetchone after the execute call. You should be able to write the solutions to all parts using one or two SQL statements (parts a and b both require a statement to create a table)).

a. Date of the Oldest Object (10 pts)

Read in the data file to a new table. Then, write code that computes the date range of the oldest objects. You can use the “Object Begin Date” attribute to make the necessary comparisons, but report the more human-readable “Object Date” attribute.

Hints:

DuckDB has built-in support for CSV Loading
You can find the minimum value from a column as a sub-select that can then be used in a comparison

b. Number of Unique Artist Names (10 pts)

Write code that computes the number of unique artist names contained in the dataset. The names are stored in the “Artist Display Name” attribute, but some objects have multiple artists, delimited by the | character, and you need to consider each of them as an individual name. For this part, any anonymous or unknown names should be counted. Your code here should create a new table with just artist names to assist with parts b, c, and d.

Hints:

DuckDB supports the CREATE TABLE ... AS (SELECT ... ) syntax to create a table from a select statement
DuckDB supports UNNEST and has a number of text functions
Rename the single column in the artist table
Recall the keyword required to count unique occurrences in SQL

c. Number of Unique Non-Anonymous Artist Names (10 pts)

Now, write code that computes the number of unique, non-anonymous artist names. This means eliminating those names from Part 1 that start with “Anonymous” and “Unknown”. Note that there are artist names like “Anonymous, German, 17th century” that also are anonymous and should be removed from the number.

Hints:

Again, refer to the text functions.
Also, recall the keyword required to count unique occurrences in SQL
You can negate a statement using NOT

d. Most Frequent Artist Name (10 pts)

Write code that computes the most frequent artist name.

Hints:

To count, consider using the GROUP BY clause
To sort, consider using the ORDER BY clause

e. Date of the Oldest Accession (10 pts)

Write code that computes the object(s) that was acquired first based on the “AccessionYear” attribute. Note that some accession years are not years (i.e. not a four-digit number), and for this assignment, ignore those entries. Print the Title attribute for the oldest object(s). Note that you need to cast to receive full credit for this question, even though it may not be necessary.

Hints:

You can convert a string to an integer by casting it. DuckDB also supports the TRY_CAST variant
Remember that you can use a sub-select in a where clause to locate the row(s) that match the desired criteria.