The goal of this assignment is to learn how to use pandas and SQL (DuckDB) for data analysis tasks.
You may choose to work on this assignment on a hosted environment
(e.g. Google Colab) or
on your own local installation of Jupyter and Python. You should use
Python 3.8 or higher for your work. If you choose to work locally, Anaconda is the easiest
way to install and manage Python. If you work locally, you may launch
Jupyter Lab either from the Navigator application or via the
command-line as jupyter-lab
. For this assignment, you will
need to make sure the pandas and python-duckdb packages are installed.
From conda, this can be accomplished with
conda install -c conda-forge pandas python-duckdb
.
In this assignment, we will again analyze the Metropolitan Museum of Art Dataset which tracks information about artworks in its collection. This dataset is provided by The Metropolitan Museum of Art (aka The Met) and information about the columns is documented here. We will be working on a subset of this data, available here. The attributes of the data are:
Object Number
: accession numberObject ID
: identifying numberDepartment
: curatorial department responsible for the
artworkAccessionYear
: year the artwork was acquiredObject Name
: describes the physical type of the
objectTitle
: title given to a work of artCulture
: information about the culture, or people from
which an object was createdPeriod
: time or time period when an object was
createdObject Date
: year or a span of years describing the
time when an artwork was designed or createdObject Begin Date
: machine readable date indicating the
year the artwork was started or createdObject End Date
: machine readable date indicating the
year the artwork was completedArtist Display Name
: artist name in the correct order
for displayYou will do the same analysis as in Assignment 1, but now using pandas and SQL. Make sure to properly label each problem’s code and solution. Use a Markdown cell to add a header denoting your work for a particular problem. Make sure to document what your answer is to the question, and make sure your code actually computes that.
The assignment is due at 11:59pm on Friday, February 18.
You should submit the completed notebook file required for this
assignment on Blackboard. The
filename of the notebook should be a2.ipynb
.
Read in the data file. Then, write code that computes the date range of the oldest objects. You can use the “Object Begin Date” attribute to make the necessary comparisons, but report the more human-readable “Object Date” attribute.
read_csv
methodWrite code that computes the number of unique artist names contained
in the dataset. This is stored in the “Artist Display Name” attribute.
Note that some objects have multiple artists, delimited
by the |
character, and you need to consider each of them
as an individual name. For this part, any anonymous or unknown names
should be counted.
str.
prefix to
the method.Now, write code that computes the number of unique, non-anonymous artist names. This means eliminating those names from Part 1 that start with “Anonymous” and “Unknown”. Note that there are artist names like “Anonymous, German, 17th century” that also are anonymous and should be removed from the number.
Write code that computes the most frequent artist name. Your effort in Part b can be reused for this part.
Write code that computes the object(s) that was acquired first based on the “AccessionYear” attribute. Note that some accession years are not years (i.e. not a four-digit number), and for this assignment, ignore those entries. Print the Title attribute for the oldest object(s).
pd.to_numeric
to_numeric
has specific options to control
what happens when a string cannot be converted. Use the appropriate
option.DuckDB has a couple of APIs for interacting with databases: the Python DB-API 2.0
and a relational API (docs). Please use the
DB-API for this assignment. In addition, DuckDB supports transferring
data to and from pandas dataframes. Again, for this assignment, all
computation must be done via SQL. To connect to a new in-memory
database, use duckdb.connect()
. To execute a SQL statement,
use duckdb.execute
, and to retrieve any results, use
fetchall
or fetchone
after the execute call.
You should be able to write the solutions to all parts using one or two
SQL statements (parts a and b both require a statement to create a
table)).
Read in the data file to a new table. Then, write code that computes the date range of the oldest objects. You can use the “Object Begin Date” attribute to make the necessary comparisons, but report the more human-readable “Object Date” attribute.
Write code that computes the number of unique artist names contained
in the dataset. The names are stored in the “Artist Display Name”
attribute, but some objects have multiple artists,
delimited by the |
character, and you need to consider each
of them as an individual name. For this part, any anonymous or unknown
names should be counted. Your code here should
create a new table with just artist names to assist
with parts b, c, and d.
CREATE TABLE ... AS (SELECT ... )
syntax to create a table from a select statementNow, write code that computes the number of unique, non-anonymous artist names. This means eliminating those names from Part 1 that start with “Anonymous” and “Unknown”. Note that there are artist names like “Anonymous, German, 17th century” that also are anonymous and should be removed from the number.
NOT
Write code that computes the most frequent artist name.
GROUP BY
clauseORDER BY
clauseWrite code that computes the object(s) that was acquired first based on the “AccessionYear” attribute. Note that some accession years are not years (i.e. not a four-digit number), and for this assignment, ignore those entries. Print the Title attribute for the oldest object(s). Note that you need to cast to receive full credit for this question, even though it may not be necessary.
TRY_CAST
variant