Assignment 5

Goals

The goal of this assignment is to work with scripts and packages in Python.

Instructions

You will be doing your work in Python for this assignment. You may choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local installation of Jupyter and Python. You should use Python 3.10 for your work, but earlier versions (>= 3.8) may also work. To use tiger, use the credentials you received. If you work remotely, make sure to download the .py files to turn in. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab.

In this assignment, we will again be working with data from the United States Department of Transportation’s Border Crossing Entry Data that we first used for Assignment 3. Recall that this dataset counts all traffic coming through the ports at the Canadian and Mexican borders. Rather than using this dataset directly, I have created a subset of this data, which can be read as a list of dictionaries. That data is located here, but you should download it via code (see Part 1a). Once loaded, the data is a list of dictionaries where each dictionary has seven key-value pairs. Those keys and a brief description are:

  • Port Name: a name for the port, often associated with its location
  • State: the state the port is located in
  • Port Code: a unique numeric identifier for the port
  • Border: the country whose border the port is for (Canada or Mexico)
  • Date: the month and year the data when the data was collected, as a string
  • Measure: the type of conveyance/container/person being counted
  • Value: the value of the measure for the specified month

You will be writing three Python modules, putting them in a package, and adding functionality to two of the modules to support command-line analysis. While not required, you will find it useful to create a notebook where you can test the modules and programs. You may use other modules from the Python stdlib (e.g. sys, collections) in this assignment.

Due Date

The assignment is due at 11:59pm on Monday, October 24.

Submission

You should submit the completed Python files required for this assignment on Blackboard. Zip the files together; the filename of the zipfile should be a5.zip. You can create an archive on tiger (assuming you created an a5 directory above the package that is your current working directory) using the following code in a notebook:

import shutil
shutil.make_archive('../a5', 'zip', '..', 'a5')

Then, download the a5.zip file to turn in via Blackboard. Make sure your archive contains all of the port_entries package.

Details

Please make sure to follow instructions to receive full credit. To test your code, you may use the %run magic command in the notebook. For example,

%run -m port_entries.find -n Eastport 

You may also use the Terminal in Jupyter on tiger, but you should make sure to activate the correct environment using conda:

$ conda activate py3.10
$ python -m port_entries.find -n Eastport

0. Name & Z-ID (5 pts)

Since we are using Python files (.py) files for this assignment, add the identifying information to the __init__.py file of your package. Minimally, you should have a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.

1. Port Entries Package

Create three new Python modules, one for reading the dataset, one for finding ports, and one for comparing two ports. Put the three modules (util.py, find.py, and compare.py) into a package named port_entries.

1a. Data Utilities (20 pts)

Create a util.py module that has two methods: download_data, get_data, and parse_data.

The download_data method should download the border-crossing.json datafile and store it locally is a file named border_crossing.json. The parse_data method should parse the json data (list of dictionaries) into a new data structure that organizes data by port. It should create a dictionary that looks like:

{<port_code>: {
  name: <port_name>,
  border: <border>
  state: <state>,
  monthly_data: { <date>: {<measure>: <value>, ...}, ...}
  },
  ...
}

In other words, port codes are keys in a dictionary that includes data about the port and a monthly data dictionary that whose keys are dates and whose values are dictionaries that contain Measure-Value key-value pairs (similar to Part 4 of Assignment 3). The get_data method should either (a) load the data from the json file, parse it using the parse_data, store it in a local module variable, and return that value; or (b) return the value of the already stored data (from the local variable). Assume that the data file resides in the same directory as util.py. You can then get its absolute path via the __file__ variable of the module via:

import os
fname = os.path.join(os.path.dirname(__file__),'border-crossing.json')

Use the json module to load the data from the file. The download_data method should download the file just once and return the local filename, otherwise returning the local filename. Refer to the Assignment 3 starter notebook for code that can be used to download data from http://faculty.cs.niu.edu/~dakoop/cs503-2022fa/assignments/a3/border-crossing.json. The get_data method should load and parse the file from disk once, otherwise returning the pre-loaded data. The parse_data method should parse the port entry data as specified above.

Hints
  • Initialize the module variable to a sentinel value to indicate when the data has not been read.
  • You can use %autoreload to automatically reload modules as you edit them. Do note, however, that this will mask the effects of trying to not keep reloading the data! You can also use importlib.reload to do this manually.

1b. Finding Items (15 pts)

Create an find.py module that has two functions, port_by_state and port_by_name. Both functions should return the ports that match the given name and/or state. port_by_state should take one parameter, the state name, while port_by_name should take two parameters, the port name and the state. However, the state is optional so if the user does not provide the state, the function should search across all states. Each function should return a list of tuples of the form (<port_code>, <port_name>, <state>). For example,

>>> port_entries.find.ports_by_state('Idaho')
[(3302, 'Eastport', 'Idaho'), (3308, 'Porthill', 'Idaho')]

>>> port_entries.find.ports_by_name('Eastport')
[(3302, 'Eastport', 'Idaho'), (103, 'Eastport', 'Maine')]

>>> port_entries.find.ports_by_name('Eastport', state='Idaho')
[(3302, 'Eastport', 'Idaho')]
Hints
  • Make sure to import the util sub-module! You might consider using relative imports to do this from a sibling module.
  • Think about what the default value for state should be for ports_by_name.

1c. Comparison (15 pts)

Create a compare.py module that calculates comparative information between two ports for a given date. Given a port code and two date strings as parameters, the diff_dates function should return the difference in values for each measure between the two months. Given two port codes and a date string as parameters, the diff_ports function should return the difference between ports in values for each measure. When one date/port does not have a value for the given measure, assume its value is zero (0).

Examples (Updated 2022-10-20):

>>> port_entries.compare.diff_dates(3302, "Jul 2021", "Aug 2021")
{'Trucks': -129,
 'Train Passengers': -10,
 'Truck Containers Loaded': -335,
 'Trains': 4,
 'Personal Vehicle Passengers': -2043,
 'Truck Containers Empty': 186,
 'Personal Vehicles': -1018,
 'Rail Containers Empty': -101,
 'Rail Containers Loaded': 723}

>>> port_entries.compare.diff_ports(2604, 2608, "Jul 2021")
{'Train Passengers': 416,
 'Truck Containers Loaded': 11083,
 'Truck Containers Empty': 6064,
 'Rail Containers Loaded': 2427,
 'Bus Passengers': 21793,
 'Trucks': 17073,
 'Buses': 455,
 'Personal Vehicle Passengers': 3663,
 'Trains': 49,
 'Personal Vehicles': -25822,
 'Rail Containers Empty': 2567,
 'Pedestrians': 2971}
Hints
  • Consider testing the functions via code in a notebook. You may also do this in the modules themselves, but remember to make sure they only run when the module is run as a script.
  • Remember that dict has a get method that allows you to provide a default value if a key does not exist.

1d. Package

Make sure all three analysis modules live in a single port_entries package. Add an __init__.py file for completeness. It may contain documentation and the pass keyword.

2. Command-Line Programs

Now, we will create two command-line programs that live in the find.py and compare.py modules we just created. These command-line programs will use the functions but produce more readable output. We will run these programs via python’s -m functionality. Thus, each of find.py and compare.py should be usable as a module and as a script. Use __name__ to check which case is being used.

2a. port_entries.find (15 pts)

In this module, we will create a command-line program that accepts two flags, -n for get_by_name searches, and -s for get_by_state searches. Both will take a single string as the final argument. That string should be passed to the respective functions, but the output should be displayed in a more succinct manner. Each data item should be listed as <port_code>: <port_name> <state>. Make sure to have a usage statement that is shown when a user enters incorrect input. You should test your script via the IPython magic command %run -m port_entries.find .... Some sample output:

>>> %run -m port_entries.find
Usage: python -m port_entries.find [-n <port-name>] [-s <state>]

>>> %run -m port_entries.find -s Idaho
3302: Eastport, Idaho
3308: Porthill, Idaho

>>> %run -m port_entries.find -n "Eastport"
3302: Eastport, Idaho
103: Eastport, Maine

>>> %run -m port_entries.find -n "Eastport" -s Idaho
3302: Eastport, Idaho
Hints
  • Make sure to use the check for __name__ to see if we are running the module as a script.

2b. port_entries.compare (15 pts)

In this module, we will create a command-line program that accepts two flags, -d for diff_dates and -p for diff_ports. The first will take one port code and two dates, and the second will take two port codes and one date and pass those parameters to their respective functions. (Note that because the dates have spaces, you will need to pass them in quotes if you want them treated as a single entry in argv.) The results of either call should show the name of each measure, and then a +/- value for the amount. Sort the measures by their names. You can test your script via the IPython magic command %run -m port_entries.compare ...

Some sample output (Updated 2022-10-20):

>>> %run -m port_entries.compare -d 103 "Jul 2021" "Aug 2021"
                Pedestrians:       -8
Personal Vehicle Passengers:     -984
          Personal Vehicles:     -509
     Truck Containers Empty:       +5
    Truck Containers Loaded:       +0
                     Trucks:       +4

>>> %run -m port_entries.compare -p 2604 2608 "Jul 2021"
             Bus Passengers:   +21793
                      Buses:     +455
                Pedestrians:    +2971
Personal Vehicle Passengers:    +3663
          Personal Vehicles:   -25822
      Rail Containers Empty:    +2567
     Rail Containers Loaded:    +2427
           Train Passengers:     +416
                     Trains:      +49
     Truck Containers Empty:    +6064
    Truck Containers Loaded:   +11083
                     Trucks:   +17073
Hints
  • Remember all elements in sys.argv are strings and may need to be converted.
  • Consider creating an auxiliary function to retrieve the data items referenced by the port_code, date pair.
  • Make sure to use the check for __name__ to see if we are running the module as a script.

3. [CSCI 503 Only] Add Measure Filtering (20 pts)

For this part, you will update the port_entries.compare module and add the ability to filter by measure. Create a new function filter_by_measure that will filter the returned items from the diff methods to only include those that match the specified measure(s). (Note that this could be integrated when doing the comparisons, but please do this as post-processing for this assignment.) Then, update the command-line program so that a user can add an optional argument -m followed by an expression representing measures. The expressions may literal characters but also the wildcard character(*), and that character should match any number of characters. You can use Python’s regular expression library to deal with wildcards, but you will need to change any instances of * to .*. The -m parameter can be before or after the -d or -p flags. Update the usage function to reflect this addition. Sample Output:

>>> res = port_entries.compare.diff_dates(103, "Jan 2021", "Feb 2021")
>>> filter_by_measure(res, "*Containers*")
{'Truck Containers Loaded': 19, 'Truck Containers Empty': 21}

>>> %run -m port_entries.compare -m "*Containers*" -d 103 "Jan 2021" "Feb 2021"
     Truck Containers Empty:      +21
    Truck Containers Loaded:      +19
Hints
  • You’ll need to check particular indices of sys.argv to determine which flag is being used.
  • Remember the difference between re.match and re.search for regular expressions.

Extra Credit

  • [15 pts] CSCI 490 Students may complete Part 3 for extra credit
  • [10 pts] Allow users to specify a year as a date and sum all the values from all of the months of that year
  • [10 pts] Modify the methods to support ranges of dates (e.g. Jun 2021-Oct 2021)