Assignment 5

Goals

The goal of this assignment is to work with scripts and packages in Python.

Instructions

You will be doing your work in Python for this assignment. You may choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local installation of Jupyter and Python. You should use Python 3.9 or higher for your work. To use tiger, use the credentials you received. If you work remotely, make sure to download the .py files to turn in. If you choose to work locally, Anaconda is the easiest way to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application or via the command-line as jupyter-lab.

In this assignment, we will again be working with data from the United States Department of Agriculture’s FoodData Central that we first used for Assignment 3. That data is located here. Once loaded, the data is a list of dictionaries where each dictionary has nine key-value pairs. Those keys and a brief description are:

  • fdc_id: a unique identifier assigned by FoodData Central
  • brand_owner: the company that makes the product
  • brand_name: a brand name, if different from the company
  • description: the product’s name or description
  • branded_food_category: the category for the food product
  • ingredients: a comma-separated string of ingredients in the product
  • serving_size: the serving size of the product in the units specified by serving_size_unit
  • serving_size_unit: the units for the serving size value
  • nutrition: a list of dictionaries containing nutrition information; each dictionary contains the keys name, amount, and unit_name and their associated values

You will be writing three Python modules, putting them in a package, and adding functionality to two of the modules to support command-line analysis. While not required, you will find it useful to create a notebook where you can test the modules and programs. You may use other modules from the Python stdlib (e.g. sys, collections) in this assignment.

Due Date

The assignment is due at 11:59pm on Thursday, April 7.

Submission

You should submit the completed Python files required for this assignment on Blackboard. Zip the files together; the filename of the zipfile should be a5.zip. You can create an archive on tiger (assuming you created an a5 directory above the package that is your current working directory) using the following code in a notebook:

import shutil
shutil.make_archive('../a5', 'zip', '..', 'a5')

Then, download the a5.zip file to turn in via Blackboard. Make sure your archive contains all of the food_data package.

Details

Please make sure to follow instructions to receive full credit. To test your code, you may use the %run magic command in the notebook. For example,

%run -m food_data.find -b "Red Gold" 

You may also use the Terminal in Jupyter on tiger, but you should activate the correct environment using conda:

$ conda activate py39
$ python -m food_data.find -b "Red Gold"  # '-b' added 4/11/2022 

0. Name & Z-ID (5 pts)

Since we are using Python files (.py) files for this assignment, add the identifying information to the __init__.py file of your package. Minimally, you should have a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.

1. Food Data Package

Create three new Python modules, one for reading the dataset, one for filtering items by brand name, and one for comparing two food items. Put the three modules (util.py, find.py, and compare.py) into a package named food_data.

1a. Data Utilities (15 pts)

Create a util.py module that has three methods: download_data, get_data, and parse_ingredients.

The download_data method should download the food-data-sample.json datafile and store it locally. The get_data method should load the data in a module variable. Assume that the data file resides in the same directory as util.py. You can then get its absolute path via the __file__ variable of the module via:

import os
fname = os.path.join(os.path.dirname(__file__),'food-data-sample.json')

Use the json module to load the data from the file. The download_data method should download the file just once, otherwise returning the local filename. Refer to the Assignment 3 starter notebook for code that can be used to download data from the file. The get_data method should load and parse the file from disk once, otherwise returning the pre-loaded data. The parse_ingredients method should parse the ingredients string into a list of ingredients by splitting on a comma. Do not worry if this does not do a perfect job (e.g. you may see an ingredient labeled as “LESS THAN 2% OF: SALT”).

Hints
  • Initialize the module variable to a sentinel value to indicate whether the data has been read.
  • You can use %autoreload to automatically reload modules as you edit them. Do note, however, that this will mask the effects of trying not to keep reloading the data! You can also use importlib.reload to do this manually.

1b. Finding Items (15 pts)

Create a find.py module that has two functions that take one parameter each, brand owner and description expression, respectively, and return the list of data items that match. Use the get_data method from the data module to obtain the data. The first method, get_by_brand, should return a list of the items filtered by an exact match to brand owner, ignoring case. The second method, get_by_description, should take a string that searches all item descriptions for those that match the given expression. The expressions may use literal characters but also the wildcard character (*) to match any number of characters. Remember that when you use Python’s regular expression library to deal with wildcards, you need to change * to .*. Your regular expression should ignore case. For example, consider the output when searching for "Quaker*Oat":

>>> [d['description'] for d in food_data.find.get_by_description("Quaker*Oat")]
['QUAKER, OAT BRAN, HOT CEREAL',
 'QUAKER, OLD FASHIONED OATS',
 'QUAKER, INSTANT OATMEAL, BANANA NUT',
 'Quaker Oat Bran Hot Cereal 16 Ounce Box',
 'Quaker Corn Crunch Toasted Corn & Oat Cereal 15 Ounce Paper Box']
Hints
  • Make sure to import the util module! You might consider using relative imports to do this from a sibling module.
  • You may need to check if a particular attribute is None
  • Remember the difference between re.match and re.search for regular expressions.

1c. Comparison (15 pts)

Create a compare.py module that calculates comparative information between two food items. Given two food items’ fdc_ids as parameters, the diff_nutrition function should return the difference between the nutrition values of the two items, and the diff_ingredients method should return the difference between the ingredients in the items. Also, diff_nutrition should additionally include the difference in serving size; it should return a list of tuples of the form (<name>, <amount>, <unit_name>). Make sure the differences are between the same measures! You should parse the ingredient list via the parse_ingredients method in the util module from Part 1a, and then create three different sets: one for shared ingredients, one for ingredients in the first item and not the second, and one for ingredients in the second item and not the first.

Examples:

>>> food_data.compare.diff_nutrition(345534, 604974)
[('Serving Size', 0.0, 'g'),
 ('Total Fat', -0.41, 'G'),
 ('Fiber', 0.0, 'G'),
 ('Calories', -12.0, 'KCAL'),
 ('Protein', 0.0, 'G'),
 ('Sugar', -2.48, 'G'),
 ('Saturated Fat', 0.0, 'G'),
 ('Sodium', 137.0, 'MG'),
 ('Carbohydrates', -0.8300000000000001, 'G')]

>>> food_data.compare.diff_ingredients(345534, 604974) 
({'Calcium Chloride', 'Citric Acid', 'Tomato Juice', 'Tomatoes'},
 {'Less Than 2% of: Salt'},
 {'Dried Garlic',
  'Dried Onion',
  'Natural Flavor',
  'Olive Oil',
  'Salt',
  'Soybean Oil',
  'Spices'})
Hints
  • Consider testing the functions via code in a notebook. You may also do this in the modules themselves, but remember to make sure they only run when the module is run as a script.
  • Remember that Python has set operators that help determine which items are shared or in one set and not the other

1d. Package

Make sure all three analysis modules live in a single food_data package. Add an __init__.py file for completeness. It may contain documentation and the pass keyword.

2. Command-Line Programs

Now, we will create two command-line programs that live in the find.py and compare.py modules we just created. These command-line programs will use the functions but produce more readable output. We will run these programs via python’s -m functionality. Thus, each of find.py and compare.py should be usable as a module and as a script. Use __name__ to check which case is being used.

2a. food_data.find (15 pts)

In this module, we will create a command-line program that accepts two flags, -b for get_by_brand searches, and -d for get_by_description searches. Both will take a single string as the final argument. That string should be passed to the respective functions, but the output should be displayed in a more succinct manner. Each data item should be listed as <fdc_id> <brand_owner> <description>. Make sure to have a usage statement that is shown when a user enters incorrect input. You should test your script via the IPython magic command %run -m food_data.find .... Some sample output:

>>> %run -m food_data.find

Usage: python -m food_data.find -b <brand> | -d <description>

>>> %run -m food_data.find -d "Quaker*Oat"

538773 The Quaker Oats Company QUAKER, OAT BRAN, HOT CEREAL
539625 The Quaker Oats Company QUAKER, OLD FASHIONED OATS
540325 The Quaker Oats Company QUAKER, INSTANT OATMEAL, BANANA NUT
767774 QTG Quaker Oat Bran Hot Cereal 16 Ounce Box
768340 QTG Quaker Corn Crunch Toasted Corn & Oat Cereal 15 Ounce Paper Box
Hints
  • Make sure to use the check for __name__ to see if we are running the module as a script.

2b. food_data.compare (15 pts)

In this module, we will create a command-line program that accepts two flags, -n for diff_nutrition and -i for diff_ingredients. Both will take two fdc_ids as integers and pass them to their respective functions. The output for diff_nutrition should show the name of the metric, and then a +/- value for the amount (with two decimal places) followed by the unit. The output for diff_ingredients should show the common ingredients prefixed by a space, the ingredients in the first item prefixed by a - and the ingredients in the second item prefixed by a +. You can test your script via the IPython magic command %run -m food_data.compare ...

Some sample output:

>>> %run -m food_data.compare -n 345534 604974

Serving Size: +0.00 g
Total Fat: -0.41 g
Fiber: +0.00 g
Calories: -12.00 kcal
Protein: +0.00 g
Sugar: -2.48 g
Saturated Fat: +0.00 g
Sodium: +137.00 mg
Carbohydrates: -0.83 g

>>> %run -m food_data.compare -i 345534 604974

  Tomatoes
  Tomato Juice
  Citric Acid
  Calcium Chloride
- Less Than 2% of: Salt
+ Natural Flavor
+ Dried Onion
+ Dried Garlic
+ Salt
+ Olive Oil
+ Spices
+ Soybean Oil
Hints
  • Remember all elements in sys.argv are strings and may need to be converted.
  • Consider creating an auxiliary function to retrieve the data items referenced by the two fdc_ids.
  • Make sure to use the check for __name__ to see if we are running the module as a script.

3. [CSCI 503 Only] Add Category Filtering (15 pts)

For this part, you will update the food_data.find module and add the ability to filter by category. Create a new function filter_by_category that will filter data items by branded_food_category. Then, update the command-line program so that a user can add an optional argument -c followed by a category. The -c parameter can be before or after the -d or -b flags. Update the usage function to reflect this addition. Sample Output:

>>> %run -m food_data.find -c "Cookies & Biscuits" -b "The Quaker Oats Company"

525081 The Quaker Oats Company PEANUT BUTTER OATMEAL COOKIES
713124 The Quaker Oats Company PEANUT BUTTER SANDWICH MINIS FILLED BISCUIT BITES, PEANUT BUTTER

>>> %run -m food_data.find -b "The Quaker Oats Company" -c "Cookies & Biscuits"

525081 The Quaker Oats Company PEANUT BUTTER OATMEAL COOKIES
713124 The Quaker Oats Company PEANUT BUTTER SANDWICH MINIS FILLED BISCUIT BITES, PEANUT BUTTER
Hints
  • You’ll need to check particular indices of sys.argv to determine which flag is being used.

Extra Credit

  • [15 pts] CSCI 490 Students may complete Part 3 for extra credit
  • [10 pts] Add a flag to the diff_nutrition method that indicates whether to scale the nutrition details according to the serving sizes before computing the difference, and modify the code as necessary, maintaining the original behavior if the flag is off. Then add a command-line flag to the compare program to allow users to toggle this on or off. For example, if item A has a 200g serving size and item B a 100g serving size, and A has 4g sugar and B has 2g sugar, scaling B to the same 200g serving size would give it 4g sugar and make the diff 0g. Scale all nutrition info like this before computing differences.