Assignment 5

Goals

The goal of this assignment is to work with scripts and packages in Python.

Instructions

You will be doing your work in a Jupyter notebook for this assignment. You may choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local installation of Jupyter and Python. You should use Python 3.12 for your work. (Older versions may work, but your code will be checked with Python 3.12.) To use tiger, use the credentials you received. If you work remotely, make sure to download the .py files to turn in. If you choose to work locally, Anaconda or miniforge are probably the easiest ways to install and manage Python. You will probably find it useful to create a notebook to test your python package, but you are not required to turn that in.

In this assignment, we will again be working with data from the United States Department of Transportation’s Border Crossing Entry Data that we first used for Assignment 3. Recall that this dataset counts all traffic coming through the ports at the Canadian and Mexican borders. Rather than using this dataset directly, I have created a subset of this data, which can be read as a list of dictionaries. That data is located here, but you should download it via code (see Part 1a). Once loaded, the data is a list of dictionaries where each dictionary has seven key-value pairs. Those keys and a brief description are:

Port Name: a name for the port, often associated with its location
State: the state the port is located in
Port Code: a unique numeric identifier for the port
Border: the country whose border the port is for (Canada or Mexico)
Date: the month and year the data when the data was collected, as a string
Measure: the type of conveyance/container/person being counted
Value: the value of the measure for the specified month

You will be writing three Python modules, putting them in a package, and adding functionality to two of the modules to support command-line analysis. While not required, you will find it useful to create a notebook where you can test the modules and programs. You may use other modules from the Python stdlib (e.g. sys, collections) in this assignment.

Due Date

The assignment is due at 11:59pm on Friday, March 21.

Submission

You should submit the completed Python files required for this assignment on Blackboard. Zip the files together; the filename of the zipfile should be a5.zip. You can create an archive on tiger (assuming you created an a5 directory above the package that is your current working directory) using the following code in a notebook:

import shutil
shutil.make_archive('../a5', 'zip', '..', 'a5')

Then, download the a5.zip file to turn in via Blackboard. Make sure your archive contains all of the port_entries package.

Details

Please make sure to follow instructions to receive full credit. To test your code, you may use the %run magic command in the notebook. For example,

%run -m port_entries find -n Eastport

You may also use the Terminal in Jupyter on tiger, but you should make sure to activate the correct environment using conda:

$ conda activate py3.12
$ python -m port_entries find -n Eastport

0. Name & Z-ID (5 pts)

Since we are using Python files (.py) files for this assignment, add the identifying information to the __init__.py file of your package. Minimally, you should have a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.

1. Port Entries Package

Create three new Python modules, one for reading the dataset, one for finding ports, and one for comparing two ports. Put the three modules (util.py, find.py, and compare.py) into a package named port_entries.

1a. Data Utilities (20 pts)

Create a util.py module that has two methods: download_data, get_data, and parse_data.

The download_data method should download the border-crossing.json datafile and store it locally is a file named border_crossing.json. The parse_data method should parse the json data (list of dictionaries) into a new data structure that organizes data by port. It should create a dictionary that looks like:

{<port_code>: {
  name: <port_name>,
  border: <border>
  state: <state>,
  monthly_data: { <date>: {<measure>: <value>, ...}, ...}
  },
  ...
}

In other words, port codes are keys in a dictionary that includes data about the port and a monthly data dictionary that whose keys are dates and whose values are dictionaries that contain Measure-Value key-value pairs (similar to Part 4 of Assignment 3). The get_data method should, if the data has already been loaded and parsed, return the value of the already stored data (from the local variable), and if not, load the data from the json file, parse it using the parse_data, store it in a local module variable, and return that value. Assume that the data file resides in the same directory as util.py. You can then get its absolute path via the __file__ variable of the module via:

import os
fname = os.path.join(os.path.dirname(__file__),'border-crossing.json')

Use the json module to load the data from the file. The download_data method should download the file just once and return the local filename, otherwise returning the local filename. Refer to the Assignment 3 starter notebook for code that can be used to download data from https://gist.githubusercontent.com/dakoop/2d80cefa399f1926a9a6d16f7d5d8757/raw/d137113184043b465c27f259e5b86f931fb133b2/border-crossing.json. The get_data method should load and parse the file from disk once, otherwise returning the pre-loaded data. The parse_data method should parse the port entry data as specified above.

Hints

Initialize the module variable to a sentinel value to indicate when the data has not been read.
Consider testing the functions via code in a notebook. You may also do this in the modules themselves, but remember to make sure they only run when the module is run as a script.
If you are using a notebook to test the module, remember that python will not automatically reload changes to the module after its initial import. You can use %autoreload to automatically reload modules as you edit them. Note, however, that this will mask the effects of trying to not keep reloading the data! You can also use importlib.reload to reload modules manually.

1b. Finding Items (15 pts)

Create an find.py module that has two functions, port_by_state and port_by_name. Both functions should return the ports that match the given name and/or state. port_by_state should take one parameter, the state name, while port_by_name should take two parameters, the port name and the state. However, the state is optional so if the user does not provide the state, the function should search across all states. Each function should return a list of tuples of the form (<port_code>, <port_name>, <state>). For example,

>>> port_entries.find.ports_by_state('Idaho')
[(3308, 'Porthill', 'Idaho'), (3302, 'Eastport', 'Idaho')]

>>> port_entries.find.ports_by_name('Eastport')
[(3302, 'Eastport', 'Idaho'), (103, 'Eastport', 'Maine')]

>>> port_entries.find.ports_by_name('Eastport', state='Idaho')
[(3302, 'Eastport', 'Idaho')]

Hints

Make sure to import the util sub-module! You might consider using relative imports to do this from a sibling module.
Think about what the default value for state should be for ports_by_name.

1c. Comparison (15 pts)

Create a compare.py module that calculates comparative information between two ports for a given date. Given a port code and two date strings as parameters, the diff_dates function should return the difference in values for each measure between the two months. Given two port codes and a date string as parameters, the diff_ports function should return the difference between ports in values for each measure. When one date/port does not have a value for the given measure, assume its value is zero (0).

Examples:

>>> port_entries.compare.diff_dates(3302, "Jul 2023", "Aug 2023")
{'Truck Containers Empty': -94,
 'Truck Containers Loaded': 433,
 'Rail Containers Loaded': 870,
 'Trucks': 332,
 'Rail Containers Empty': 118,
 'Bus Passengers': 356,
 'Personal Vehicles': 1077,
 'Buses': 10,
 'Train Passengers': 4,
 'Personal Vehicle Passengers': 2257,
 'Trains': 7}

>>> port_entries.compare.diff_ports(2604, 2608, "Jul 2023")
{'Truck Containers Empty': 5923,
 'Truck Containers Loaded': 24983,
 'Rail Containers Loaded': 5102,
 'Pedestrians': 112880,
 'Trucks': 19547,
 'Rail Containers Empty': 1808,
 'Personal Vehicles': 55736,
 'Bus Passengers': 19990,
 'Buses': 771,
 'Personal Vehicle Passengers': 195843,
 'Trains': 73}

Hints

Consider testing the functions via code in a notebook. You may also do this in the modules themselves, but remember to make sure they only run when the module is run as a script.
Remember that dict has a get method that allows you to provide a default value if a key does not exist.

1d. Package

Make sure all three analysis modules live in a single port_entries package. Add an __init__.py file for completeness. It may contain documentation (including your name and z-id) and the pass keyword.

2. Command-Line Program

Now, we will create one command-line program for our python package that uses the find.py and compare.py modules we just created. The command-line program will use the functions but produce more readable output. We will run this program via python’s -m functionality and subcommands.

2a. Program (5 pts)

Create a __main__.py file to store the command-line program. It will need to switch between the find and compare subcommands so you can start by writing code that examines the first argument and ensures that it is either “find” or “compare” and prints a usage statement if not. Check that your code is working by running the package:

>>> %run -m port_entries find

>>> %run -m port_entries calc
Usage: python -m port_entries <find | compare>

2b. Find subcommand (15 pts)

First, we will create the subcommand find that accepts two flags, -n for get_by_name searches, and -s for get_by_state searches. Both will take a single string as the final argument. That string should be passed to the respective functions, but the output should be displayed in a more succinct manner. Each data item should be listed as <port_code>: <port_name> <state>. Make sure to have a usage statement that is shown when a user enters incorrect input. You should test your script via the IPython magic command %run -m port_entries find .... Some sample output:

>>> %run -m port_entries find
Usage: python -m port_entries find [-n <port-name>] [-s <state>]

>>> %run -m port_entries find -s Idaho
3302: Eastport, Idaho
3308: Porthill, Idaho

>>> %run -m port_entries find -n "Eastport"
3302: Eastport, Idaho
103: Eastport, Maine

>>> %run -m port_entries find -n "Eastport" -s Idaho
3302: Eastport, Idaho

2c. Compare subcommand (15 pts)

Second, we will add a second subcommand (compare) that accepts two flags, -d for diff_dates and -p for diff_ports. The first will take one port code and two dates, and the second will take two port codes and one date and pass those parameters to their respective functions. (Note that because the dates have spaces, you will need to pass them in quotes if you want them treated as a single entry in argv.) The results of either call should show the name of each measure, and then a +/- value for the amount. Sort the measures by their names. You can test your script in a notebook via the IPython magic command %run -m port_entries compare ...

Some sample output:

>>>  %run -m port_entries compare -d 103 "Jul 2023" "Aug 2023"
             Bus Passengers:       -8
                      Buses:       +0
                Pedestrians:      -15
Personal Vehicle Passengers:    -2251
          Personal Vehicles:    -1151
     Truck Containers Empty:      -12
    Truck Containers Loaded:      -39
                     Trucks:      +32

>>> %run -m port_entries compare -p 2604 2608 "Jul 2023"
             Bus Passengers:   +19990
                      Buses:     +771
                Pedestrians:  +112880
Personal Vehicle Passengers:  +195843
          Personal Vehicles:   +55736
      Rail Containers Empty:    +1808
     Rail Containers Loaded:    +5102
                     Trains:      +73
     Truck Containers Empty:    +5923
    Truck Containers Loaded:   +24983
                     Trucks:   +19547

Hints

Remember all elements in sys.argv are strings and may need to be converted.
Consider creating an auxiliary function to retrieve the data items referenced by the port_code, date pair.

3. [CSCI 503 Only] Add Measure Filtering (20 pts)

For this part, you will update the port_entries.compare module and add the ability to filter by measure. Create a new function filter_by_measure that will filter the returned items from the diff methods to only include those that match the specified measure(s). (Note that this could be integrated when doing the comparisons, but please do this as post-processing for this assignment.) Then, update the command-line program so that a user can add an optional argument -m followed by an expression representing measures to the compare subcommand. The expressions may literal characters but also the wildcard character(*), and that character should match any number of characters. You can use Python’s regular expression library to deal with wildcards, but you will need to change any instances of * to .*. The -m parameter can be before or after the -d or -p flags. Update the usage function to reflect this addition. Sample Output:

>>> res = port_entries.compare.diff_dates(103, "Jan 2023", "Feb 2023")
>>> filter_by_measure(res, "*Containers*")
{'Truck Containers Empty': -25, 'Truck Containers Loaded': -8}

>>> %run -m port_entries compare -m "*Containers*" -d 103 "Jan 2023" "Feb 2023"
     Truck Containers Empty:      -25
    Truck Containers Loaded:       -8

Hints

You’ll need to check particular indices of sys.argv to determine which flag is being used.
Remember the difference between re.match and re.search for regular expressions.

Extra Credit

[15 pts] CSCI 490 Students may complete Part 3 for extra credit
[10 pts] Allow users to specify a year as a date and sum all the values from all of the months of that year
[10 pts] Modify the methods to support ranges of dates (e.g. Jun 2023-Oct 2023)