The goal of this assignment is to work with scripts and packages in Python.
You will be doing your work in a Jupyter notebook for this assignment. You may choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local installation of Jupyter and Python. You should use Python 3.12 for your work. (Older versions may work, but your code will be checked with Python 3.12.) To use tiger, use the credentials you received. If you work remotely, make sure to download the .py files to turn in. If you choose to work locally, Anaconda or miniforge are probably the easiest ways to install and manage Python. You will probably find it useful to create a notebook to test your python package, but you are not required to turn that in.
In this assignment, we will again be working with data from the United States Department of Transportation’s Border Crossing Entry Data that we first used for Assignment 3. Recall that this dataset counts all traffic coming through the ports at the Canadian and Mexican borders. Rather than using this dataset directly, I have created a subset of this data, which can be read as a list of dictionaries. That data is located here, but you should download it via code (see Part 1a). Once loaded, the data is a list of dictionaries where each dictionary has seven key-value pairs. Those keys and a brief description are:
Port Name
: a name for the port, often associated with
its locationState
: the state the port is located inPort Code
: a unique numeric identifier for the
portBorder
: the country whose border the port is for
(Canada or Mexico)Date
: the month and year the data when the data was
collected, as a stringMeasure
: the type of conveyance/container/person being
countedValue
: the value of the measure for the specified
monthYou will be writing three Python modules, putting them in a package, and adding functionality to two of the modules to support command-line analysis. While not required, you will find it useful to create a notebook where you can test the modules and programs. You may use other modules from the Python stdlib (e.g. sys, collections) in this assignment.
The assignment is due at 11:59pm on Friday, March 21.
You should submit the completed Python files required for this
assignment on Blackboard. Zip
the files together; the filename of the zipfile should be
a5.zip
. You can create an archive on tiger (assuming you
created an a5 directory above the package that is your current working
directory) using the following code in a notebook:
import shutil
'../a5', 'zip', '..', 'a5') shutil.make_archive(
Then, download the a5.zip file to turn in via Blackboard. Make sure
your archive contains all of the
port_entries
package.
Please make sure to follow instructions to receive full credit. To
test your code, you may use the %run
magic command in the
notebook. For example,
%run -m port_entries find -n Eastport
You may also use the Terminal in Jupyter on tiger, but you should make sure to activate the correct environment using conda:
$ conda activate py3.12
$ python -m port_entries find -n Eastport
Since we are using Python files (.py) files for this assignment, add
the identifying information to the __init__.py
file of your
package. Minimally, you should have a line for your name and a line for
your Z-ID. If you wish to add other information (the assignment name, a
description of the assignment), you may do so after these two lines.
Create three new Python modules, one for reading the dataset, one for
finding ports, and one for comparing two ports. Put the three modules
(util.py
, find.py
, and
compare.py
) into a package named
port_entries
.
Create a util.py
module that has two methods:
download_data
, get_data
, and
parse_data
.
The download_data
method should download the border-crossing.json
datafile and store it locally is a file named
border_crossing.json
. The parse_data
method
should parse the json data (list of dictionaries) into a new data
structure that organizes data by port. It should create a dictionary
that looks like:
{<port_code>: {
name: <port_name>,
border: <border>
state: <state>,
monthly_data: { <date>: {<measure>: <value>, ...}, ...}
},
...
}
In other words, port codes are keys in a dictionary that includes
data about the port and a monthly data dictionary that whose keys are
dates and whose values are dictionaries that contain Measure-Value
key-value pairs (similar to Part 4 of Assignment 3). The
get_data
method should, if the data has already been loaded
and parsed, return the value of the already stored data (from the local
variable), and if not, load the data from the json file, parse it using
the parse_data
, store it in a local module variable, and
return that value. Assume that the data file resides in the same
directory as util.py
. You can then get its absolute path
via the __file__
variable of the module via:
import os
= os.path.join(os.path.dirname(__file__),'border-crossing.json') fname
Use the json
module to load the data from the file. The
download_data
method should download the file just
once and return the local filename, otherwise returning
the local filename. Refer to the Assignment 3
starter notebook for code that can be used to download data from https://gist.githubusercontent.com/dakoop/2d80cefa399f1926a9a6d16f7d5d8757/raw/d137113184043b465c27f259e5b86f931fb133b2/border-crossing.json.
The get_data
method should load and parse the file from
disk once, otherwise returning the pre-loaded data. The
parse_data
method should parse the port entry data as
specified above.
%autoreload
to automatically reload modules as you edit them. Note, however, that
this will mask the effects of trying to not keep reloading the data! You
can also use importlib.reload
to reload modules
manually.Create an find.py
module that has two functions,
port_by_state
and port_by_name
. Both functions
should return the ports that match the given name and/or state.
port_by_state
should take one parameter, the state name,
while port_by_name
should take two parameters, the port
name and the state. However, the state is optional so
if the user does not provide the state, the function should search
across all states. Each function should return a list
of tuples of the form
(<port_code>, <port_name>, <state>)
. For
example,
>>> port_entries.find.ports_by_state('Idaho')
[(3308, 'Porthill', 'Idaho'), (3302, 'Eastport', 'Idaho')]
>>> port_entries.find.ports_by_name('Eastport')
[(3302, 'Eastport', 'Idaho'), (103, 'Eastport', 'Maine')]
>>> port_entries.find.ports_by_name('Eastport', state='Idaho')
[(3302, 'Eastport', 'Idaho')]
ports_by_name
.Create a compare.py
module that calculates comparative
information between two ports for a given date. Given a port code and
two date strings as parameters, the diff_dates
function
should return the difference in values for each measure between the two
months. Given two port codes and a date string as parameters, the
diff_ports
function should return the difference between
ports in values for each measure. When one date/port does not have a
value for the given measure, assume its value is zero (0).
Examples:
>>> port_entries.compare.diff_dates(3302, "Jul 2023", "Aug 2023")
{'Truck Containers Empty': -94,
'Truck Containers Loaded': 433,
'Rail Containers Loaded': 870,
'Trucks': 332,
'Rail Containers Empty': 118,
'Bus Passengers': 356,
'Personal Vehicles': 1077,
'Buses': 10,
'Train Passengers': 4,
'Personal Vehicle Passengers': 2257,
'Trains': 7}
>>> port_entries.compare.diff_ports(2604, 2608, "Jul 2023")
{'Truck Containers Empty': 5923,
'Truck Containers Loaded': 24983,
'Rail Containers Loaded': 5102,
'Pedestrians': 112880,
'Trucks': 19547,
'Rail Containers Empty': 1808,
'Personal Vehicles': 55736,
'Bus Passengers': 19990,
'Buses': 771,
'Personal Vehicle Passengers': 195843,
'Trains': 73}
get
method that allows you to
provide a default value if a key does not exist.Make sure all three analysis modules live in a single
port_entries
package. Add an __init__.py
file
for completeness. It may contain documentation (including your name and
z-id) and the pass keyword.
Now, we will create one command-line program for our python package
that uses the find.py
and compare.py
modules
we just created. The command-line program will use the functions but
produce more readable output. We will run this program via python’s
-m
functionality and subcommands.
Create a __main__.py
file to store the command-line
program. It will need to switch between the find
and
compare
subcommands so you can start by writing code that
examines the first argument and ensures that it is either “find” or
“compare” and prints a usage statement if not. Check that your code is
working by running the package:
>>> %run -m port_entries find
>>> %run -m port_entries calc
Usage: python -m port_entries <find | compare>
First, we will create the subcommand find
that accepts
two flags, -n
for get_by_name
searches, and
-s
for get_by_state
searches. Both will take a
single string as the final argument. That string should be passed to the
respective functions, but the output should be displayed in a more
succinct manner. Each data item should be listed as
<port_code>: <port_name> <state>
. Make
sure to have a usage statement that is shown when a user enters
incorrect input. You should test your script via the IPython magic
command %run -m port_entries find ...
. Some sample
output:
>>> %run -m port_entries find
Usage: python -m port_entries find [-n <port-name>] [-s <state>]
>>> %run -m port_entries find -s Idaho
3302: Eastport, Idaho
3308: Porthill, Idaho
>>> %run -m port_entries find -n "Eastport"
3302: Eastport, Idaho
103: Eastport, Maine
>>> %run -m port_entries find -n "Eastport" -s Idaho
3302: Eastport, Idaho
Second, we will add a second subcommand (compare
) that
accepts two flags, -d
for diff_dates
and
-p
for diff_ports
. The first will take one
port code and two dates, and the second will take two port codes and one
date and pass those parameters to their respective functions. (Note that
because the dates have spaces, you will need to pass them in quotes if
you want them treated as a single entry in argv.) The results of either
call should show the name of each measure, and then a +/- value for the
amount. Sort the measures by their names. You can test your script in a
notebook via the IPython magic command
%run -m port_entries compare ...
Some sample output:
>>> %run -m port_entries compare -d 103 "Jul 2023" "Aug 2023"
Bus Passengers: -8
Buses: +0
Pedestrians: -15
Personal Vehicle Passengers: -2251
Personal Vehicles: -1151
Truck Containers Empty: -12
Truck Containers Loaded: -39
Trucks: +32
>>> %run -m port_entries compare -p 2604 2608 "Jul 2023"
Bus Passengers: +19990
Buses: +771
Pedestrians: +112880
Personal Vehicle Passengers: +195843
Personal Vehicles: +55736
Rail Containers Empty: +1808
Rail Containers Loaded: +5102
Trains: +73
Truck Containers Empty: +5923
Truck Containers Loaded: +24983
Trucks: +19547
sys.argv
are
strings and may need to be converted.For this part, you will update the port_entries.compare
module and add the ability to filter by measure. Create
a new function filter_by_measure
that will filter the
returned items from the diff methods to only include those that match
the specified measure(s). (Note that this could be integrated when doing
the comparisons, but please do this as post-processing for this
assignment.) Then, update the command-line program so that a user can
add an optional argument -m
followed by an
expression representing measures to the compare
subcommand.
The expressions may literal characters but also the wildcard
character(*
), and that character should match any number of
characters. You can use Python’s regular expression library to deal with
wildcards, but you will need to change any instances of *
to .*
. The -m
parameter can be before or after
the -d
or -p
flags. Update the usage function
to reflect this addition. Sample Output:
>>> res = port_entries.compare.diff_dates(103, "Jan 2023", "Feb 2023")
>>> filter_by_measure(res, "*Containers*")
{'Truck Containers Empty': -25, 'Truck Containers Loaded': -8}
>>> %run -m port_entries compare -m "*Containers*" -d 103 "Jan 2023" "Feb 2023"
Truck Containers Empty: -25
Truck Containers Loaded: -8
sys.argv
to
determine which flag is being used.re.match
and
re.search
for regular
expressions.