CSCI 640/490 – Assignment 2

Goals

The goal of this assignment is to learn how to use pandas and SQL (DuckDB) for data analysis tasks.

Instructions

You may choose to work on this assignment on a hosted environment (e.g. Google Colab) or on your own local installation of Jupyter and Python. You should use Python 3.10 or higher for your work. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter lab. For this assignment, you will need to make sure the pandas, python-duckdb, and polars packages are installed. From conda, this can be accomplished with conda install -c conda-forge pandas python-duckdb polars.

In this assignment, we will again analyze the Metropolitan Museum of Art Dataset which tracks information about artworks in its collection. This dataset is provided by The Metropolitan Museum of Art (aka The Met) and information about the columns is documented here. We will be working on a subset of this data, available compressed here. Note: All frameworks can read the .csv.gz format directly (no need to decompress). The attributes of the data are:

Object Number: accession number
Object ID: identifying number
Department: curatorial department responsible for the artwork
AccessionYear: year the artwork was acquired
Object Name: describes the physical type of the object
Title: title given to a work of art
Culture: information about the culture, or people from which an object was created
Period: time or time period when an object was created
Object Date: year or a span of years describing the time when an artwork was designed or created
Object Begin Date: machine readable date indicating the year the artwork was started or created
Object End Date: machine readable date indicating the year the artwork was completed
Artist Display Name: artist name in the correct order for display

You will do the same analysis as in Assignment 1, but now using pandas, SQL, and polars. Make sure to properly label each problem’s code and solution. Use a Markdown cell to add a header denoting your work for a particular problem. Make sure to document what your answer is to the question, and make sure your code actually computes that.

Due Date

The assignment is due at 11:59pm on Friday, February 16.

Submission

You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a2.ipynb.

Details

CSCI 640 students must complete all three parts; CSCI 490 students may omit the third part but will receive extra credit if they complete it. You may choose to complete the assignment question by question or framework by framework.

1. Pandas

a. Date of the Oldest Object (10 pts)

Read in the data file and store it in appropriate data structures. Then, write code that computes the date of the oldest objects. You can use the “Object Begin Date” and “Object End Date” attributes to make the necessary comparisons, but report the more human-readable “Object Date” attribute.

Hints:

Pandas has a read_csv method
Remember that you can sort rows based on particular attributes
Do not use loops over all the data

b. Department with Most Items (10 pts)

Write code that computes the department that holds the most items in this subset.

Hints:

Pandas has a method to count value occurrences
Remember to sort any value counts

c. Number of Unique Artist Names (10 pts)

Write code that computes the number of unique artist names contained in the dataset. This is stored in the “Artist Display Name” attribute. Note that some objects have multiple artists, delimited by the | character, and you need to consider each of them as an individual name. For this part, any anonymous or unknown names should be counted.

Hints:

Pandas has a number of string methods which are helpful in processing a column of strings with a single command. Remember to add the str. prefix to the method.
Pandas also has methods to count the unique number of items
Consider persisting this series for parts c and d

d. Number of Unique Non-Anonymous Artist Names (10 pts)

Now, write code that computes the number of unique, non-anonymous artist names. This means eliminating those names from Part c that start with “Anonymous” and “Unknown”. Note that there are artist names like “Anonymous, German, 17th century” that also are anonymous and should be removed from the number.

Hints:

Again, refer to the string methods.

e. Date of the Newest Accession(s) (10 pts)

Write code that computes the object(s) that was acquired last based on the “AccessionYear” attribute. Note that some accession years are not years (i.e. not a four-digit number), and for this assignment, ignore those entries. Print the Title attribute for the newest object(s).

Hints:

You can convert a string to an integer using pd.to_numeric
Note that to_numeric has specific options to control what happens when a string cannot be converted. Use the appropriate option.

2. DuckDB

DuckDB has a couple of APIs for interacting with databases: the Python DB-API 2.0 and a relational API (docs). Please use the DB-API for this assignment. In addition, DuckDB supports transferring data to and from pandas dataframes. Again, for this assignment, all computation must be done via SQL. To connect to a new in-memory database, use duckdb.connect(). To execute a SQL statement, use duckdb.execute, and to retrieve any results, use fetchall or fetchone after the execute call. You should be able to write the solutions to all parts using one or two SQL statements (parts a and b both require a statement to create a table)).

a. Date of the Oldest Object (10 pts)

Read in the data file to a new table. Then, write code that computes the date of the oldest objects. You can use the “Object Begin Date” and “Object End Date” attributes to make the necessary comparisons, but report the more human-readable “Object Date” attribute.

Hints:

DuckDB has built-in support for CSV Loading
You can find the minimum value from a column as a sub-select that can then be used in a comparison

b. Department with Most Items (10 pts)

Write code that computes the department that holds the most items in this subset.

Hints:

To count, consider using the GROUP BY clause
To sort, consider using the ORDER BY clause

c. Number of Unique Artist Names (10 pts)

Write code that computes the number of unique artist names contained in the dataset. The names are stored in the “Artist Display Name” attribute, but some objects have multiple artists, delimited by the | character, and you need to consider each of them as an individual name. For this part, any anonymous or unknown names should be counted. Your code here should create a new table with just artist names to assist with parts c and d.

Hints:

DuckDB supports the CREATE TABLE ... AS (SELECT ... ) syntax to create a table from a select statement
DuckDB supports UNNEST and has a number of text functions
Rename the single column in the artist table
Recall the keyword required to count unique occurrences in SQL

d. Number of Unique Non-Anonymous Artist Names (10 pts)

Now, write code that computes the number of unique, non-anonymous artist names. This means eliminating those names from Part c that start with “Anonymous” and “Unknown”. Note that there are artist names like “Anonymous, German, 17th century” that also are anonymous and should be removed from the number.

Hints:

Again, refer to the text functions.
Also, recall the keyword required to count unique occurrences in SQL
You can negate a statement using NOT

e. Date of the Newest Accession(s) (10 pts)

Write code that computes the object(s) that was acquired last based on the “AccessionYear” attribute. Note that some accession years are not years (i.e. not a four-digit number), and for this assignment, ignore those entries. Print the Title attribute for the newest object(s). Note that you need to cast to receive full credit for this question, even though it may not be necessary.

Hints:

You can convert a string to an integer by casting it. DuckDB also supports the TRY_CAST variant
Remember that you can use a sub-select in a where clause to locate the row(s) that match the desired criteria.

3. [CSCI 640 Only] Polars

Polars is a newer framework that replicates much of what pandas does but hews closer to SQL syntax. A key advantage is that it is designed for effective parallelism. You will complete the same tasks as Parts 1 and 2 but using polars.

a. Date of the Oldest Object (10 pts)

Read in the data file. Then, write code that computes the date of the oldest objects. You can use the “Object Begin Date” and “Object End Date” attributes to make the necessary comparisons, but report the more human-readable “Object Date” attribute.

Hints:

You may see an error about the types. Consider increasing infer_schema_length as is suggested.

b. Department with Most Items (10 pts)

Write code that computes the department that holds the most items in this subset.

Hints:

polars also has a method to count instances of values

c. Number of Unique Artist Names (10 pts)

Write code that computes the number of unique artist names contained in the dataset. The names are stored in the “Artist Display Name” attribute, but some objects have multiple artists, delimited by the | character, and you need to consider each of them as an individual name. For this part, any anonymous or unknown names should be counted.

Hints:

Like pandas, polars has string methods, although the naming is slightly different

d. Number of Unique Non-Anonymous Artist Names (10 pts)

Now, write code that computes the number of unique, non-anonymous artist names. This means eliminating those names from Part c that start with “Anonymous” and “Unknown”. Note that there are artist names like “Anonymous, German, 17th century” that also are anonymous and should be removed from the number.

Hints:

Like pandas, polars has string methods, although the naming is slightly different

e. Date of the Newest Accession(s) (10 pts)

Write code that computes the object(s) that was acquired last based on the “AccessionYear” attribute. Note that some accession years are not years (i.e. not a four-digit number), and for this assignment, ignore those entries. Print the Title attribute for the newest object(s).

Hints:

polars uses the cast operator to change data types