Assignment 5

Goals

The goal of this assignment is to work with scripts and packages in Python.

Instructions

You will be doing your work in a Jupyter notebook for this assignment. You may choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local installation of Jupyter and Python. You should use Python 3.12 for your work. (Older versions may work, but your code will be checked with Python 3.12.) To use tiger, use the credentials you received. If you work remotely, make sure to download the .ipynb file to turn in. If you choose to work locally, Anaconda or miniforge are probably the easiest ways to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application (anaconda) or via the command-line as jupyter-lab or jupyter lab.

In this assignment, we will again be working with data from the United States Department of Agriculture’s FoodData Central that we first used for Assignment 3. That data is located here. Once loaded, the data is a list of dictionaries where each dictionary has nine key-value pairs. Those keys and a brief description are:

  • fdc_id: a unique identifier assigned by FoodData Central
  • brand_owner: the company that makes the product
  • brand_name: a brand name, if different from the company
  • description: the product’s name or description
  • branded_food_category: the category for the food product
  • ingredients: a comma-separated string of ingredients in the product
  • serving_size: the serving size of the product in the units specified by serving_size_unit
  • serving_size_unit: the units for the serving size value
  • nutrition: a list of dictionaries containing nutrition information; each dictionary contains the keys name, amount, and unit_name and their associated values

You will be writing three Python modules, putting them in a package, and adding functionality to two of the modules to support command-line analysis. While not required, you will find it useful to create a notebook where you can test the modules and programs. You may use other modules from the Python stdlib (e.g. sys, collections) in this assignment.

Due Date

The assignment is due at 11:59pm on Monday, October 28.

Submission

You should submit the completed Python files required for this assignment on Blackboard. Zip the files together; the filename of the zipfile should be a5.zip. You can create an archive on tiger (assuming you created an a5 directory above the package that is your current working directory) using the following code in a notebook:

import shutil
shutil.make_archive('../a5', 'zip', '..', 'a5')

Then, download the a5.zip file to turn in via Blackboard. Make sure your archive contains all of the food_data package.

Details

Please make sure to follow instructions to receive full credit. To test your code, you may use the %run magic command in the notebook. For example,

%run -m food_data.find -b "Red Gold" 

You may also use the Terminal in Jupyter on tiger, but you should activate the correct environment using conda:

$ conda activate py3.12
$ python -m food_data.find "Red Gold"

0. Name & Z-ID (5 pts)

Since we are using Python files (.py) files for this assignment, add the identifying information to the __init__.py file of your package. Minimally, you should have a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.

1. Food Data Package

Create three new Python modules, one for reading the dataset, one for filtering items by brand name, and one for comparing two food items. Put the three modules (util.py, find.py, and compare.py) into a package named food_data.

1a. Data Utilities (15 pts)

Create a util.py module that has two methods: download_data, get_data, and parse_ingredients.

The download_data method should download the food-data-sample.json datafile and store it locally. The get_data method should load the data in a module variable. Assume that the data file resides in the same directory as util.py. You can then get its absolute path via the __file__ variable of the module via:

from pathlib import Path
fname = Path(__file__).parent / 'food-data-sample.json'

Use the json module to load the data from the file. The download_data method should download the file just once, otherwise returning the local filename. Refer to the Assignment 3 starter notebook for code that can be used to download data from http://faculty.cs.niu.edu/~dakoop/cs503-2024fa/assignments/a3/food-data-sample.json. The get_data method should load and parse the file from disk once, otherwise returning the pre-loaded data. The parse_ingredients method should parse the ingredients string into a list of ingredients by splitting on a comma. Do not worry if this does not do a perfect job (e.g. you may see an ingredient labeled as “LESS THAN 2% OF: SALT”).

Hints
  • Initialize the module variable to a sentinel value to indicate when the data has not been read.
  • You can use %autoreload to automatically reload modules as you edit them. Do note, however, that this will mask the effects of trying to not keep reloading the data! You can also use importlib.reload to do this manually.

1b. Finding Items (15 pts)

Create an find.py module that has two functions that both take one parameter, the brand owner or description expression, respectively, and return the list of data items that match. Use the get_data method from the data module to obtain the data. The first method, get_by_brand, should return a list of the items filtered by an exact match to brand owner, ignoring case. The second method, get_by_description, should take a string that searches all item descriptions for those that match the given expression. The expressions may literal characters but also the wildcard character(*), and that character should match any number of characters. You can use Python’s regular expression library to deal with wildcards, but you will need to change any instances of * to .*. Your regular expression should ignore case. For example, consider the output when searching for "Quaker*Oat":

>>> [d['description'] for d in food_data.find.get_by_description("Quaker*Oat")]
['Quaker Oatmeal to Go Apple Cinnamon (6 - 2.1 Ounce) 12.6 Ounce 6 Pack Plastic Bag',
 'Quaker Oatmeal Squares Golden Maple 14.5 Ounce Paper Box',
 'QUAKER FRUIT & OATMEAL FRTFLLEDBR BERRY   10.4Z']
Hints
  • Make sure to import the util module! You might consider using relative imports to do this from a sibling module.
  • You may need to check if a particular attribute is None
  • Remember the difference between re.match and re.search for regular expressions.

1c. Comparison (15 pts)

Create a compare.py module that calculates comparative information between two food items. Given two food items’ fdc_ids as parameters, the diff_nutrition function should return the difference between the nutrition values of the two items, and the diff_ingredients method should return the difference between the ingredients in the items. Also, diff_nutrition should additionally include the difference in serving size; it should return a list of tuples of the form (<name>, <amount>, <unit_name>). Make sure the differences are between the same measures! The ingredient list should be parsed via parse_ingredients method in the util module from Part 1a, and then create three different sets: one for shared ingredients, one for ingredients in the first item and not the second, and one for ingredients in the second it-me and not the first.

Examples:

>>> food_data.compare.diff_nutrition(622876, 1120164)
[('Serving Size', 2.0, 'g'),
 ('Saturated Fat', -3.7799999999999994, 'G'),
 ('Calories', -193.0, 'KCAL'),
 ('Total Fat', -25.91, 'G'),
 ('Sugar', 12.9, 'G'),
 ('Carbohydrates', 9.45, 'G'),
 ('Protein', -3.45, 'G'),
 ('Sodium', 448.0, 'MG'),
 ('Fiber', 0.0, 'G')]

>>> food_data.compare.diff_ingredients(622876, 1120164) 
({'CALCIUM DISODIUM EDTA ADDED TO PROTECT FLAVOR',
  'DISTILLED VINEGAR',
  'LEMON JUICE CONCENTRATE',
  'SALT',
  'SOYBEAN OIL',
  'WATER',
  'XANTHAN GUM'},
 {'ANCHOVIES',
  'CELERY SEED',
  'CHEESE CULTURE',
  'CORN SYRUP',
  'ENZYMES)',
  'GARLIC*',
  'MUSTARD SEED',
  'NATURAL FLAVOR',
  'ONION*',
  'PARMESAN CHEESE (PART-SKIM MILK',
  'POTASSIUM SORBATE (ADDED TO MAINTAIN FRESHNESS)',
  'RED WINE VINEGAR',
  'SPICE',
  'TAMARIND.'},
 {'ANNATTO COLOR.',
  'BALSAMIC VINEGAR (PRESERVED WITH SULFITES)',
  'BEET JUICE CONCENTRATE',
  'CARAMEL COLOR',
  'DEHYDRATED GARLIC',
  'HIGH FRUCTOSE CORN SYRUP',
  'SPICES',
  'SUGAR'})
Hints
  • Consider testing the functions via code in a notebook. You may also do this in the modules themselves, but remember to make sure they only run when the module is run as a script.
  • Remember that Python has set operators that help determine which items are shared or in one set and not the other

1d. Package

Make sure all three analysis modules live in a single food_data package. Add an __init__.py file for completeness. It may contain documentation and the pass keyword.

2. Command-Line Programs

Now, we will create two command-line programs that live in the find.py and compare.py modules we just created. These command-line programs will use the functions but produce more readable output. We will run these programs via python’s -m functionality. Thus, each of find.py and compare.py should be usable as a module and as a script. Use __name__ to check which case is being used.

2a. food_data.find (15 pts)

In this module, we will create a command-line program that accepts two flags, -b for get_by_brand searches, and -d for get_by_description searches. Both will take a single string as the final argument. That string should be passed to the respective functions, but the output should be displayed in a more succinct manner. Each data item should be listed as <fdc_id> <brand_owner> <description>. Make sure to have a usage statement that is shown when a user enters incorrect input. You should test your script via the IPython magic command %run -m food_data.find .... Some sample output:

>>> %run -m food_data.find

Usage: python -m food_data.find -b <brand> | -d <description>

>>> %run -m food_data.find -d "Quaker*Oat"

1167761 Pepsico Inc. Quaker Oatmeal to Go Apple Cinnamon (6 - 2.1 Ounce) 12.6 Ounce 6 Pack Plastic Bag
1460360 Pepsico Inc. Quaker Oatmeal Squares Golden Maple 14.5 Ounce Paper Box
1460396 Pepsico Inc. QUAKER FRUIT & OATMEAL FRTFLLEDBR BERRY   10.4Z
Hints
  • Make sure to use the check for __name__ to see if we are running the module as a script.

2b. food_data.compare (15 pts)

In this module, we will create a command-line program that accepts two flags, -n for diff_nutrition and -i for diff_ingredients. Both will take two fdc_ids as integers and pass them to their respective functions. The output for diff_nutrition should show the name of the metric, and then a +/- value for the amount (with two decimal places) followed by the unit. The output for diff_ingredients should show the common ingredients prefixed by a space, the ingredients in the first item prefixed by a - and the ingredients in the second item prefixed by a +. You can test your script via the IPython magic command %run -m food_data.compare ...

Some sample output:

>>> %run -m food_data.compare -n 622876 1120164

Serving Size: +2.00 g
Saturated Fat: -3.78 g
Calories: -193.00 kcal
Total Fat: -25.91 g
Sugar: +12.90 g
Carbohydrates: +9.45 g
Protein: -3.45 g
Sodium: +448.00 mg
Fiber: +0.00 g

>>> %run -m food_data.compare -i 622876 1120164

  WATER
  DISTILLED VINEGAR
  XANTHAN GUM
  SALT
  SOYBEAN OIL
  CALCIUM DISODIUM EDTA ADDED TO PROTECT FLAVOR
  LEMON JUICE CONCENTRATE
- DEHYDRATED GARLIC
- BALSAMIC VINEGAR (PRESERVED WITH SULFITES)
- ANNATTO COLOR.
- SPICES
- BEET JUICE CONCENTRATE
- SUGAR
- CARAMEL COLOR
- HIGH FRUCTOSE CORN SYRUP
+ CHEESE CULTURE
+ GARLIC*
+ RED WINE VINEGAR
+ CORN SYRUP
+ ONION*
+ ENZYMES)
+ NATURAL FLAVOR
+ TAMARIND.
+ CELERY SEED
+ PARMESAN CHEESE (PART-SKIM MILK
+ ANCHOVIES
+ SPICE
+ POTASSIUM SORBATE (ADDED TO MAINTAIN FRESHNESS)
+ MUSTARD SEED
Hints
  • Remember all elements in sys.argv are strings and may need to be converted.
  • Consider creating an auxiliary function to retrieve the data items referenced by the two fdc_ids.
  • Make sure to use the check for __name__ to see if we are running the module as a script.

3. [CSCI 503 Only] Add Category Filtering (15 pts)

For this part, you will update the food_data.find module and add the ability to filter by category. Create a new function filter_by_category that will filter data items by branded_food_category. Then, update the command-line program so that a user can add an optional argument -c followed by a category. The -c parameter can be before or after the -d or -b flags. Update the usage function to reflect this addition. Sample Output:

>>> %run -m food_data.find -c "Salad Dressing & Mayonnaise" -b "T. Marzetti Company"

1120164 T. Marzetti Company CAESAR VINAIGRETTE DRESSING, CAESAR VINAIGRETTE
622876 T. Marzetti Company BALSAMIC VINAIGRETTE DRESSING, BALSAMIC VINAIGRETTE
1207924 T. Marzetti Company BALSAMIC CABERNET VINEYARD DRESSING
1218889 T. Marzetti Company BALSAMIC VINAIGRETTE, BALSAMIC
1219019 T. Marzetti Company RANCH DRESSING, RANCH
1511045 T. Marzetti Company SIGNATURE ITALIAN DRESSING, SIGNATURE ITALIAN
1667586 T. Marzetti Company THREE CHEESE CAESAR DRESSING, THREE CHEESE CAESAR
2022391 T. Marzetti Company BALSAMIC VINAIGRETTE DRESSING, BALSAMIC VINAIGRETTE
2269286 T. Marzetti Company THOUSAND ISLAND DRESSING, THOUSAND ISLAND

>>> %run -m food_data.find -b "T. Marzetti Company" -c "Dips & Salsa"

1110302 T. Marzetti Company HOT SRIRACHA RANCH VEGGIE DIP, HOT SRIRACHA RANCH
938428 T. Marzetti Company ROASTED RED PEPPER SIMPLE HARVEST DIP, ROASTED RED PEPPER
2036349 T. Marzetti Company SPINACH VEGGIE DIP, SPINACH
Hints
  • You’ll need to check particular indices of sys.argv to determine which flag is being used.

Extra Credit

  • [15 pts] CSCI 490 Students may complete Part 3 for extra credit
  • [10 pts] Add a flag to the diff_nutrition method that indicates whether to scale the nutrition details according to the serving sizes before computing the difference, and modify the code as necessary, maintaining the original behavior if the flag is off. Then add a command-line flag to the compare program to allow users to toggle this on or off. For example, if item A has a 200g serving size and item B a 100g serving size, and A has 4g sugar and B has 2g sugar, scaling B to the same 200g serving size would give it 4g sugar and make the diff 0g. Scale all nutrition info like this before computing differences.