Assignment 2

Goals

The goal of this assignment is to learn how to use pandas and SQL (DuckDB) for data analysis tasks.

Instructions

You may choose to work on this assignment on a hosted environment (e.g. Google Colab) or on your own local installation of Jupyter and Python. You should use Python 3.10 or higher for your work. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter lab. For this assignment, you will need to make sure the pandas, python-duckdb, and polars packages are installed. From conda, this can be accomplished with conda install -c conda-forge pandas python-duckdb polars.

In this assignment, we will again analyze the Metropolitan Museum of Art Dataset which tracks information about artworks in its collection. This dataset is provided by The Metropolitan Museum of Art (aka The Met) and information about the columns is documented here. We will be working on a subset of this data, available compressed here. Note: All frameworks can read the .csv.gz format directly (no need to decompress). The attributes of the data are:

  1. Object Number: accession number
  2. Object ID: identifying number
  3. Department: curatorial department responsible for the artwork
  4. AccessionYear: year the artwork was acquired
  5. Object Name: describes the physical type of the object
  6. Title: title given to a work of art
  7. Culture: information about the culture, or people from which an object was created
  8. Period: time or time period when an object was created
  9. Object Date: year or a span of years describing the time when an artwork was designed or created
  10. Object Begin Date: machine readable date indicating the year the artwork was started or created
  11. Object End Date: machine readable date indicating the year the artwork was completed
  12. Artist Display Name: artist name in the correct order for display

You will do the same analysis as in Assignment 1, but now using pandas, SQL, and polars. Make sure to properly label each problem’s code and solution. Use a Markdown cell to add a header denoting your work for a particular problem. Make sure to document what your answer is to the question, and make sure your code actually computes that.

Due Date

The assignment is due at 11:59pm on Friday, February 16.

Submission

You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a2.ipynb.

Details

CSCI 640 students must complete all three parts; CSCI 490 students may omit the third part but will receive extra credit if they complete it. You may choose to complete the assignment question by question or framework by framework.

1. Pandas

a. Date of the Oldest Object (10 pts)

Read in the data file and store it in appropriate data structures. Then, write code that computes the date of the oldest objects. You can use the “Object Begin Date” and “Object End Date” attributes to make the necessary comparisons, but report the more human-readable “Object Date” attribute.

Hints:
  • Pandas has a read_csv method
  • Remember that you can sort rows based on particular attributes
  • Do not use loops over all the data

b. Department with Most Items (10 pts)

Write code that computes the department that holds the most items in this subset.

Hints:
  • Pandas has a method to count value occurrences
  • Remember to sort any value counts

c. Number of Unique Artist Names (10 pts)

Write code that computes the number of unique artist names contained in the dataset. This is stored in the “Artist Display Name” attribute. Note that some objects have multiple artists, delimited by the | character, and you need to consider each of them as an individual name. For this part, any anonymous or unknown names should be counted.

Hints:
  • Pandas has a number of string methods which are helpful in processing a column of strings with a single command. Remember to add the str. prefix to the method.
  • Pandas also has methods to count the unique number of items
  • Consider persisting this series for parts c and d

d. Number of Unique Non-Anonymous Artist Names (10 pts)

Now, write code that computes the number of unique, non-anonymous artist names. This means eliminating those names from Part c that start with “Anonymous” and “Unknown”. Note that there are artist names like “Anonymous, German, 17th century” that also are anonymous and should be removed from the number.

Hints:

e. Date of the Newest Accession(s) (10 pts)

Write code that computes the object(s) that was acquired last based on the “AccessionYear” attribute. Note that some accession years are not years (i.e. not a four-digit number), and for this assignment, ignore those entries. Print the Title attribute for the newest object(s).

Hints:
  • You can convert a string to an integer using pd.to_numeric
  • Note that to_numeric has specific options to control what happens when a string cannot be converted. Use the appropriate option.

2. DuckDB

DuckDB has a couple of APIs for interacting with databases: the Python DB-API 2.0 and a relational API (docs). Please use the DB-API for this assignment. In addition, DuckDB supports transferring data to and from pandas dataframes. Again, for this assignment, all computation must be done via SQL. To connect to a new in-memory database, use duckdb.connect(). To execute a SQL statement, use duckdb.execute, and to retrieve any results, use fetchall or fetchone after the execute call. You should be able to write the solutions to all parts using one or two SQL statements (parts a and b both require a statement to create a table)).

a. Date of the Oldest Object (10 pts)

Read in the data file to a new table. Then, write code that computes the date of the oldest objects. You can use the “Object Begin Date” and “Object End Date” attributes to make the necessary comparisons, but report the more human-readable “Object Date” attribute.

Hints:
  • DuckDB has built-in support for CSV Loading
  • You can find the minimum value from a column as a sub-select that can then be used in a comparison

b. Department with Most Items (10 pts)

Write code that computes the department that holds the most items in this subset.

Hints:
  • To count, consider using the GROUP BY clause
  • To sort, consider using the ORDER BY clause

c. Number of Unique Artist Names (10 pts)

Write code that computes the number of unique artist names contained in the dataset. The names are stored in the “Artist Display Name” attribute, but some objects have multiple artists, delimited by the | character, and you need to consider each of them as an individual name. For this part, any anonymous or unknown names should be counted. Your code here should create a new table with just artist names to assist with parts c and d.

Hints:
  • DuckDB supports the CREATE TABLE ... AS (SELECT ... ) syntax to create a table from a select statement
  • DuckDB supports UNNEST and has a number of text functions
  • Rename the single column in the artist table
  • Recall the keyword required to count unique occurrences in SQL

d. Number of Unique Non-Anonymous Artist Names (10 pts)

Now, write code that computes the number of unique, non-anonymous artist names. This means eliminating those names from Part c that start with “Anonymous” and “Unknown”. Note that there are artist names like “Anonymous, German, 17th century” that also are anonymous and should be removed from the number.

Hints:
  • Again, refer to the text functions.
  • Also, recall the keyword required to count unique occurrences in SQL
  • You can negate a statement using NOT

e. Date of the Newest Accession(s) (10 pts)

Write code that computes the object(s) that was acquired last based on the “AccessionYear” attribute. Note that some accession years are not years (i.e. not a four-digit number), and for this assignment, ignore those entries. Print the Title attribute for the newest object(s). Note that you need to cast to receive full credit for this question, even though it may not be necessary.

Hints:
  • You can convert a string to an integer by casting it. DuckDB also supports the TRY_CAST variant
  • Remember that you can use a sub-select in a where clause to locate the row(s) that match the desired criteria.

3. [CSCI 640 Only] Polars

Polars is a newer framework that replicates much of what pandas does but hews closer to SQL syntax. A key advantage is that it is designed for effective parallelism. You will complete the same tasks as Parts 1 and 2 but using polars.

a. Date of the Oldest Object (10 pts)

Read in the data file. Then, write code that computes the date of the oldest objects. You can use the “Object Begin Date” and “Object End Date” attributes to make the necessary comparisons, but report the more human-readable “Object Date” attribute.

Hints:
  • You may see an error about the types. Consider increasing infer_schema_length as is suggested.

b. Department with Most Items (10 pts)

Write code that computes the department that holds the most items in this subset.

Hints:
  • polars also has a method to count instances of values

c. Number of Unique Artist Names (10 pts)

Write code that computes the number of unique artist names contained in the dataset. The names are stored in the “Artist Display Name” attribute, but some objects have multiple artists, delimited by the | character, and you need to consider each of them as an individual name. For this part, any anonymous or unknown names should be counted.

Hints:
  • Like pandas, polars has string methods, although the naming is slightly different

d. Number of Unique Non-Anonymous Artist Names (10 pts)

Now, write code that computes the number of unique, non-anonymous artist names. This means eliminating those names from Part c that start with “Anonymous” and “Unknown”. Note that there are artist names like “Anonymous, German, 17th century” that also are anonymous and should be removed from the number.

Hints:
  • Like pandas, polars has string methods, although the naming is slightly different

e. Date of the Newest Accession(s) (10 pts)

Write code that computes the object(s) that was acquired last based on the “AccessionYear” attribute. Note that some accession years are not years (i.e. not a four-digit number), and for this assignment, ignore those entries. Print the Title attribute for the newest object(s).

Hints:
  • polars uses the cast operator to change data types