The goal of this assignment is to learn how to use pandas and SQL (DuckDB) for data analysis tasks.
You may choose to work on this assignment on a hosted environment
(e.g. Google Colab) or
on your own local installation of Jupyter and Python. You should use
Python 3.10 or higher for your work. If you choose to work locally, Anaconda is the easiest
way to install and manage Python. If you work locally, you may launch
Jupyter Lab either from the Navigator application or via the
command-line as jupyter lab
. For this assignment, you will
need to make sure the pandas, python-duckdb, and polars packages are
installed. From conda, this can be accomplished with
conda install -c conda-forge pandas python-duckdb polars
.
In this assignment, we will again analyze the Metropolitan Museum of
Art Dataset which tracks information about artworks in its
collection. This dataset is provided by The Metropolitan Museum of Art (aka
The Met) and information about the columns is documented here. We will be working
on a subset of this data, available compressed here. Note: All
frameworks can read the .csv.gz
format directly (no need to
decompress). The attributes of the data are:
Object Number
: accession numberObject ID
: identifying numberDepartment
: curatorial department responsible for the
artworkAccessionYear
: year the artwork was acquiredObject Name
: describes the physical type of the
objectTitle
: title given to a work of artCulture
: information about the culture, or people from
which an object was createdPeriod
: time or time period when an object was
createdObject Date
: year or a span of years describing the
time when an artwork was designed or createdObject Begin Date
: machine readable date indicating the
year the artwork was started or createdObject End Date
: machine readable date indicating the
year the artwork was completedArtist Display Name
: artist name in the correct order
for displayYou will do the same analysis as in Assignment 1, but now using pandas, SQL, and polars. Make sure to properly label each problem’s code and solution. Use a Markdown cell to add a header denoting your work for a particular problem. Make sure to document what your answer is to the question, and make sure your code actually computes that.
The assignment is due at 11:59pm on Friday, February 16.
You should submit the completed notebook file required for this
assignment on Blackboard. The
filename of the notebook should be a2.ipynb
.
CSCI 640 students must complete all three parts; CSCI 490 students may omit the third part but will receive extra credit if they complete it. You may choose to complete the assignment question by question or framework by framework.
Read in the data file and store it in appropriate data structures. Then, write code that computes the date of the oldest objects. You can use the “Object Begin Date” and “Object End Date” attributes to make the necessary comparisons, but report the more human-readable “Object Date” attribute.
read_csv
methodWrite code that computes the department that holds the most items in this subset.
Write code that computes the number of unique artist names contained
in the dataset. This is stored in the “Artist Display Name” attribute.
Note that some objects have multiple artists, delimited
by the |
character, and you need to consider each of them
as an individual name. For this part, any anonymous or unknown names
should be counted.
str.
prefix to
the method.Now, write code that computes the number of unique, non-anonymous artist names. This means eliminating those names from Part c that start with “Anonymous” and “Unknown”. Note that there are artist names like “Anonymous, German, 17th century” that also are anonymous and should be removed from the number.
Write code that computes the object(s) that was acquired last based on the “AccessionYear” attribute. Note that some accession years are not years (i.e. not a four-digit number), and for this assignment, ignore those entries. Print the Title attribute for the newest object(s).
pd.to_numeric
to_numeric
has specific options to control
what happens when a string cannot be converted. Use the appropriate
option.DuckDB has a couple of APIs for interacting with databases: the Python DB-API 2.0
and a relational API (docs). Please use the
DB-API for this assignment. In addition, DuckDB supports transferring
data to and from pandas dataframes. Again, for this assignment, all
computation must be done via SQL. To connect to a new in-memory
database, use duckdb.connect()
. To execute a SQL statement,
use duckdb.execute
, and to retrieve any results, use
fetchall
or fetchone
after the execute call.
You should be able to write the solutions to all parts using one or two
SQL statements (parts a and b both require a statement to create a
table)).
Read in the data file to a new table. Then, write code that computes the date of the oldest objects. You can use the “Object Begin Date” and “Object End Date” attributes to make the necessary comparisons, but report the more human-readable “Object Date” attribute.
Write code that computes the department that holds the most items in this subset.
GROUP BY
clauseORDER BY
clauseWrite code that computes the number of unique artist names contained
in the dataset. The names are stored in the “Artist Display Name”
attribute, but some objects have multiple artists,
delimited by the |
character, and you need to consider each
of them as an individual name. For this part, any anonymous or unknown
names should be counted. Your code here should
create a new table with just artist names to assist
with parts c and d.
CREATE TABLE ... AS (SELECT ... )
syntax to create a table from a select statementNow, write code that computes the number of unique, non-anonymous artist names. This means eliminating those names from Part c that start with “Anonymous” and “Unknown”. Note that there are artist names like “Anonymous, German, 17th century” that also are anonymous and should be removed from the number.
NOT
Write code that computes the object(s) that was acquired last based on the “AccessionYear” attribute. Note that some accession years are not years (i.e. not a four-digit number), and for this assignment, ignore those entries. Print the Title attribute for the newest object(s). Note that you need to cast to receive full credit for this question, even though it may not be necessary.
TRY_CAST
variantPolars is a newer framework that replicates much of what pandas does but hews closer to SQL syntax. A key advantage is that it is designed for effective parallelism. You will complete the same tasks as Parts 1 and 2 but using polars.
Read in the data file. Then, write code that computes the date of the oldest objects. You can use the “Object Begin Date” and “Object End Date” attributes to make the necessary comparisons, but report the more human-readable “Object Date” attribute.
infer_schema_length
as is suggested.Write code that computes the department that holds the most items in this subset.
Write code that computes the number of unique artist names contained
in the dataset. The names are stored in the “Artist Display Name”
attribute, but some objects have multiple artists,
delimited by the |
character, and you need to consider each
of them as an individual name. For this part, any anonymous or unknown
names should be counted.
Now, write code that computes the number of unique, non-anonymous artist names. This means eliminating those names from Part c that start with “Anonymous” and “Unknown”. Note that there are artist names like “Anonymous, German, 17th century” that also are anonymous and should be removed from the number.
Write code that computes the object(s) that was acquired last based on the “AccessionYear” attribute. Note that some accession years are not years (i.e. not a four-digit number), and for this assignment, ignore those entries. Print the Title attribute for the newest object(s).