The goal of this assignment is to work with scripts and packages in Python.
You will be doing your work in Python for this assignment. You may
choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local
installation of Jupyter and Python. You should use Python 3.10 for your
work, but earlier versions (>= 3.8) may also work. To use tiger, use
the credentials you received. If you work remotely, make sure to
download the .py files to turn in. If you choose to work locally, Anaconda is the easiest way
to install and manage Python. If you work locally, you may launch
Jupyter Lab either from the Navigator application or via the
command-line as jupyter-lab
.
In this assignment, we will again be working with data from the United States Department of Transportation’s Border Crossing Entry Data that we first used for Assignment 3. Recall that this dataset counts all traffic coming through the ports at the Canadian and Mexican borders. Rather than using this dataset directly, I have created a subset of this data, which can be read as a list of dictionaries. That data is located here, but you should download it via code (see Part 1a). Once loaded, the data is a list of dictionaries where each dictionary has seven key-value pairs. Those keys and a brief description are:
Port Name
: a name for the port, often associated with
its locationState
: the state the port is located inPort Code
: a unique numeric identifier for the
portBorder
: the country whose border the port is for
(Canada or Mexico)Date
: the month and year the data when the data was
collected, as a stringMeasure
: the type of conveyance/container/person being
countedValue
: the value of the measure for the specified
monthYou will be writing three Python modules, putting them in a package, and adding functionality to two of the modules to support command-line analysis. While not required, you will find it useful to create a notebook where you can test the modules and programs. You may use other modules from the Python stdlib (e.g. sys, collections) in this assignment.
The assignment is due at 11:59pm on Monday, October 24.
You should submit the completed Python files required for this
assignment on Blackboard. Zip
the files together; the filename of the zipfile should be
a5.zip
. You can create an archive on tiger (assuming you
created an a5 directory above the package that is your current working
directory) using the following code in a notebook:
import shutil
'../a5', 'zip', '..', 'a5') shutil.make_archive(
Then, download the a5.zip file to turn in via Blackboard. Make sure
your archive contains all of the
port_entries
package.
Please make sure to follow instructions to receive full credit. To
test your code, you may use the %run
magic command in the
notebook. For example,
%run -m port_entries.find -n Eastport
You may also use the Terminal in Jupyter on tiger, but you should make sure to activate the correct environment using conda:
$ conda activate py3.10
$ python -m port_entries.find -n Eastport
Since we are using Python files (.py) files for this assignment, add
the identifying information to the __init__.py
file of your
package. Minimally, you should have a line for your name and a line for
your Z-ID. If you wish to add other information (the assignment name, a
description of the assignment), you may do so after these two lines.
Create three new Python modules, one for reading the dataset, one for
finding ports, and one for comparing two ports. Put the three modules
(util.py
, find.py
, and
compare.py
) into a package named
port_entries
.
Create a util.py
module that has two methods:
download_data
, get_data
, and
parse_data
.
The download_data
method should download the border-crossing.json datafile and
store it locally is a file named border_crossing.json
. The
parse_data
method should parse the json data (list of
dictionaries) into a new data structure that organizes data by port. It
should create a dictionary that looks like:
{<port_code>: {
name: <port_name>,
border: <border>
state: <state>,
monthly_data: { <date>: {<measure>: <value>, ...}, ...}
},
...
}
In other words, port codes are keys in a dictionary that includes
data about the port and a monthly data dictionary that whose keys are
dates and whose values are dictionaries that contain Measure-Value
key-value pairs (similar to Part 4 of Assignment 3). The
get_data
method should either (a) load the data from the
json file, parse it using the parse_data
, store it in a
local module variable, and return that value; or (b) return the value of
the already stored data (from the local variable). Assume that the data
file resides in the same directory as util.py
. You can then
get its absolute path via the __file__
variable of the
module via:
import os
= os.path.join(os.path.dirname(__file__),'border-crossing.json') fname
Use the json
module to load the data from the file. The
download_data
method should download the file just
once and return the local filename, otherwise returning
the local filename. Refer to the Assignment 3
starter notebook for code that can be used to download data from http://faculty.cs.niu.edu/~dakoop/cs503-2022fa/assignments/a3/border-crossing.json.
The get_data
method should load and parse the file from
disk once, otherwise returning the pre-loaded data. The
parse_data
method should parse the port entry data as
specified above.
%autoreload
to automatically reload modules as you edit them. Do note, however, that
this will mask the effects of trying to not keep reloading the data! You
can also use importlib.reload
to do this manually.Create an find.py
module that has two functions,
port_by_state
and port_by_name
. Both functions
should return the ports that match the given name and/or state.
port_by_state
should take one parameter, the state name,
while port_by_name
should take two parameters, the port
name and the state. However, the state is optional so
if the user does not provide the state, the function should search
across all states. Each function should return a list
of tuples of the form
(<port_code>, <port_name>, <state>)
. For
example,
>>> port_entries.find.ports_by_state('Idaho')
[(3302, 'Eastport', 'Idaho'), (3308, 'Porthill', 'Idaho')]
>>> port_entries.find.ports_by_name('Eastport')
[(3302, 'Eastport', 'Idaho'), (103, 'Eastport', 'Maine')]
>>> port_entries.find.ports_by_name('Eastport', state='Idaho')
[(3302, 'Eastport', 'Idaho')]
ports_by_name
.Create a compare.py
module that calculates comparative
information between two ports for a given date. Given a port code and
two date strings as parameters, the diff_dates
function
should return the difference in values for each measure between the two
months. Given two port codes and a date string as parameters, the
diff_ports
function should return the difference between
ports in values for each measure. When one date/port does not have a
value for the given measure, assume its value is zero (0).
Examples (Updated 2022-10-20):
>>> port_entries.compare.diff_dates(3302, "Jul 2021", "Aug 2021")
{'Trucks': -129,
'Train Passengers': -10,
'Truck Containers Loaded': -335,
'Trains': 4,
'Personal Vehicle Passengers': -2043,
'Truck Containers Empty': 186,
'Personal Vehicles': -1018,
'Rail Containers Empty': -101,
'Rail Containers Loaded': 723}
>>> port_entries.compare.diff_ports(2604, 2608, "Jul 2021")
{'Train Passengers': 416,
'Truck Containers Loaded': 11083,
'Truck Containers Empty': 6064,
'Rail Containers Loaded': 2427,
'Bus Passengers': 21793,
'Trucks': 17073,
'Buses': 455,
'Personal Vehicle Passengers': 3663,
'Trains': 49,
'Personal Vehicles': -25822,
'Rail Containers Empty': 2567,
'Pedestrians': 2971}
get
method that allows you to
provide a default value if a key does not exist.Make sure all three analysis modules live in a single
port_entries
package. Add an __init__.py
file
for completeness. It may contain documentation and the pass keyword.
Now, we will create two command-line programs that live in the
find.py
and compare.py
modules we just
created. These command-line programs will use the functions but produce
more readable output. We will run these programs via python’s
-m
functionality. Thus, each of find.py and compare.py
should be usable as a module and as a script. Use __name__
to check which case is being used.
port_entries.find
(15 pts)In this module, we will create a command-line program that accepts
two flags, -n
for get_by_name
searches, and
-s
for get_by_state
searches. Both will take a
single string as the final argument. That string should be passed to the
respective functions, but the output should be displayed in a more
succinct manner. Each data item should be listed as
<port_code>: <port_name> <state>
. Make
sure to have a usage statement that is shown when a user enters
incorrect input. You should test your script via the IPython magic
command %run -m port_entries.find ...
. Some sample
output:
>>> %run -m port_entries.find
Usage: python -m port_entries.find [-n <port-name>] [-s <state>]
>>> %run -m port_entries.find -s Idaho
3302: Eastport, Idaho
3308: Porthill, Idaho
>>> %run -m port_entries.find -n "Eastport"
3302: Eastport, Idaho
103: Eastport, Maine
>>> %run -m port_entries.find -n "Eastport" -s Idaho
3302: Eastport, Idaho
__name__
to see if we
are running the module as a script.port_entries.compare
(15 pts)In this module, we will create a command-line program that accepts
two flags, -d
for diff_dates
and
-p
for diff_ports
. The first will take one
port code and two dates, and the second will take two port codes and one
date and pass those parameters to their respective functions. (Note that
because the dates have spaces, you will need to pass them in quotes if
you want them treated as a single entry in argv.) The results of either
call should show the name of each measure, and then a +/- value for the
amount. Sort the measures by their names. You can test your script via
the IPython magic command
%run -m port_entries.compare ...
Some sample output (Updated 2022-10-20):
>>> %run -m port_entries.compare -d 103 "Jul 2021" "Aug 2021"
Pedestrians: -8
Personal Vehicle Passengers: -984
Personal Vehicles: -509
Truck Containers Empty: +5
Truck Containers Loaded: +0
Trucks: +4
>>> %run -m port_entries.compare -p 2604 2608 "Jul 2021"
Bus Passengers: +21793
Buses: +455
Pedestrians: +2971
Personal Vehicle Passengers: +3663
Personal Vehicles: -25822
Rail Containers Empty: +2567
Rail Containers Loaded: +2427
Train Passengers: +416
Trains: +49
Truck Containers Empty: +6064
Truck Containers Loaded: +11083
Trucks: +17073
sys.argv
are
strings and may need to be converted.__name__
to see if we
are running the module as a script.For this part, you will update the port_entries.compare
module and add the ability to filter by measure. Create
a new function filter_by_measure
that will filter the
returned items from the diff methods to only include those that match
the specified measure(s). (Note that this could be integrated when doing
the comparisons, but please do this as post-processing for this
assignment.) Then, update the command-line program so that a user can
add an optional argument -m
followed by an
expression representing measures. The expressions may literal characters
but also the wildcard character(*
), and that character
should match any number of characters. You can use Python’s regular
expression library to deal with wildcards, but you will need to change
any instances of *
to .*
. The -m
parameter can be before or after the -d
or -p
flags. Update the usage function to reflect this addition. Sample
Output:
>>> res = port_entries.compare.diff_dates(103, "Jan 2021", "Feb 2021")
>>> filter_by_measure(res, "*Containers*")
{'Truck Containers Loaded': 19, 'Truck Containers Empty': 21}
>>> %run -m port_entries.compare -m "*Containers*" -d 103 "Jan 2021" "Feb 2021"
Truck Containers Empty: +21
Truck Containers Loaded: +19
sys.argv
to
determine which flag is being used.re.match
and
re.search
for regular
expressions.