The goal of this assignment is to work with scripts and packages in Python.
You will be doing your work in Python for this assignment. You may
choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local
installation of Jupyter and Python. You should use Python 3.9 or higher
for your work. To use tiger, use the credentials you received. If you
work remotely, make sure to download the .py files to turn in. If you
choose to work locally, Anaconda is the easiest way
to install and manage Python. If you work locally, you may launch
Jupyter Lab either from the Navigator application or via the
command-line as jupyter-lab
.
In this assignment, we will again be working with data from the United States Department of Agriculture’s FoodData Central that we first used for Assignment 3. That data is located here. Once loaded, the data is a list of dictionaries where each dictionary has nine key-value pairs. Those keys and a brief description are:
fdc_id
: a unique identifier assigned by FoodData
Centralbrand_owner
: the company that makes the productbrand_name
: a brand name, if different from the
companydescription
: the product’s name or descriptionbranded_food_category
: the category for the food
productingredients
: a comma-separated string of ingredients in
the productserving_size
: the serving size of the product in the
units specified by serving_size_unit
serving_size_unit
: the units for the serving size
valuenutrition
: a list of dictionaries containing nutrition
information; each dictionary contains the keys name
,
amount
, and unit_name
and their associated
valuesYou will be writing three Python modules, putting them in a package, and adding functionality to two of the modules to support command-line analysis. While not required, you will find it useful to create a notebook where you can test the modules and programs. You may use other modules from the Python stdlib (e.g. sys, collections) in this assignment.
The assignment is due at 11:59pm on Thursday, April 7.
You should submit the completed Python files required for this
assignment on Blackboard. Zip
the files together; the filename of the zipfile should be
a5.zip
. You can create an archive on tiger (assuming you
created an a5 directory above the package that is your current working
directory) using the following code in a notebook:
import shutil
'../a5', 'zip', '..', 'a5') shutil.make_archive(
Then, download the a5.zip file to turn in via Blackboard. Make sure
your archive contains all of the food_data
package.
Please make sure to follow instructions to receive full credit. To
test your code, you may use the %run
magic command in the
notebook. For example,
%run -m food_data.find -b "Red Gold"
You may also use the Terminal in Jupyter on tiger, but you should activate the correct environment using conda:
$ conda activate py39
$ python -m food_data.find -b "Red Gold" # '-b' added 4/11/2022
Since we are using Python files (.py) files for this assignment, add
the identifying information to the __init__.py
file of your
package. Minimally, you should have a line for your name and a line for
your Z-ID. If you wish to add other information (the assignment name, a
description of the assignment), you may do so after these two lines.
Create three new Python modules, one for reading the dataset, one for
filtering items by brand name, and one for comparing two food items. Put
the three modules (util.py
, find.py
, and
compare.py
) into a package named
food_data
.
Create a util.py
module that has three methods:
download_data
, get_data
, and
parse_ingredients
.
The download_data
method should download the food-data-sample.json datafile and
store it locally. The get_data
method should load the data
in a module variable. Assume that the data file resides in the same
directory as util.py
. You can then get its absolute path
via the __file__
variable of the module via:
import os
= os.path.join(os.path.dirname(__file__),'food-data-sample.json') fname
Use the json
module to load the data from the file. The
download_data
method should download the file just
once, otherwise returning the local filename. Refer to
the Assignment 3 starter notebook for code
that can be used to download data from the file.
The get_data
method should load and parse the file from
disk once, otherwise returning the pre-loaded data. The
parse_ingredients
method should parse the ingredients
string into a list of ingredients by splitting on a comma. Do not worry
if this does not do a perfect job (e.g. you may see an ingredient
labeled as “LESS THAN 2% OF: SALT”).
%autoreload
to automatically reload modules as you edit them. Do note, however, that
this will mask the effects of trying not to keep reloading the data! You
can also use importlib.reload
to do this manually.Create a find.py
module that has two functions that
take one parameter each, brand owner and description expression,
respectively, and return the list of data items that match. Use the
get_data
method from the data module to obtain the data.
The first method, get_by_brand
, should return a list of the
items filtered by an exact match to
brand owner
, ignoring case. The second method,
get_by_description
, should take a string that searches all
item descriptions for those that match the given expression. The
expressions may use literal characters but also the wildcard
character (*
) to match any number of
characters. Remember that when you use Python’s regular expression library to deal with
wildcards, you need to change *
to .*
. Your regular expression should
ignore case. For example, consider the output when
searching for "Quaker*Oat"
:
>>> [d['description'] for d in food_data.find.get_by_description("Quaker*Oat")]
['QUAKER, OAT BRAN, HOT CEREAL',
'QUAKER, OLD FASHIONED OATS',
'QUAKER, INSTANT OATMEAL, BANANA NUT',
'Quaker Oat Bran Hot Cereal 16 Ounce Box',
'Quaker Corn Crunch Toasted Corn & Oat Cereal 15 Ounce Paper Box']
None
re.match
and
re.search
for regular
expressions.
Create a compare.py
module that calculates comparative
information between two food items. Given two food items’
fdc_id
s as parameters, the diff_nutrition
function should return the difference between the nutrition
values of the two items, and the diff_ingredients
method should return the difference between the ingredients in the
items. Also, diff_nutrition
should additionally include the
difference in serving size; it should return a list of tuples of the
form (<name>, <amount>, <unit_name>)
.
Make sure the differences are between the same measures! You should
parse the ingredient
list via the parse_ingredients
method in the
util
module from Part 1a, and then create three different
sets: one for shared ingredients, one for ingredients in the first item
and not the second, and one for ingredients in the second item and not
the first.
Examples:
>>> food_data.compare.diff_nutrition(345534, 604974)
[('Serving Size', 0.0, 'g'),
('Total Fat', -0.41, 'G'),
('Fiber', 0.0, 'G'),
('Calories', -12.0, 'KCAL'),
('Protein', 0.0, 'G'),
('Sugar', -2.48, 'G'),
('Saturated Fat', 0.0, 'G'),
('Sodium', 137.0, 'MG'),
('Carbohydrates', -0.8300000000000001, 'G')]
>>> food_data.compare.diff_ingredients(345534, 604974)
({'Calcium Chloride', 'Citric Acid', 'Tomato Juice', 'Tomatoes'},
{'Less Than 2% of: Salt'},
{'Dried Garlic',
'Dried Onion',
'Natural Flavor',
'Olive Oil',
'Salt',
'Soybean Oil',
'Spices'})
Make sure all three analysis modules live in a single
food_data
package. Add an __init__.py
file for
completeness. It may contain documentation and the pass keyword.
Now, we will create two command-line programs that live in the
find.py
and compare.py
modules we just
created. These command-line programs will use the functions but produce
more readable output. We will run these programs via python’s
-m
functionality. Thus, each of find.py and compare.py
should be usable as a module and as a script. Use __name__
to check which case is being used.
food_data.find
(15 pts)In this module, we will create a command-line program that accepts
two flags, -b
for get_by_brand
searches, and
-d
for get_by_description
searches. Both will
take a single string as the final argument. That string should be passed
to the respective functions, but the output should be displayed in a
more succinct manner. Each data item should be listed as
<fdc_id> <brand_owner> <description>
.
Make sure to have a usage statement that is shown when a user enters
incorrect input. You should test your script via the IPython magic
command %run -m food_data.find ...
. Some sample output:
>>> %run -m food_data.find
Usage: python -m food_data.find -b <brand> | -d <description>
>>> %run -m food_data.find -d "Quaker*Oat"
538773 The Quaker Oats Company QUAKER, OAT BRAN, HOT CEREAL
539625 The Quaker Oats Company QUAKER, OLD FASHIONED OATS
540325 The Quaker Oats Company QUAKER, INSTANT OATMEAL, BANANA NUT
767774 QTG Quaker Oat Bran Hot Cereal 16 Ounce Box
768340 QTG Quaker Corn Crunch Toasted Corn & Oat Cereal 15 Ounce Paper Box
__name__
to see if we
are running the module as a script.food_data.compare
(15 pts)In this module, we will create a command-line program that accepts
two flags, -n
for diff_nutrition
and
-i
for diff_ingredients
. Both will take two
fdc_ids as integers and pass them to their respective functions. The
output for diff_nutrition
should show the name of the
metric, and then a +/- value for the amount (with two decimal places)
followed by the unit. The output for diff_ingredients
should show the common ingredients prefixed by a space, the ingredients
in the first item prefixed by a -
and the ingredients in
the second item prefixed by a +
. You can test your script
via the IPython magic command
%run -m food_data.compare ...
Some sample output:
>>> %run -m food_data.compare -n 345534 604974
Serving Size: +0.00 g
Total Fat: -0.41 g
Fiber: +0.00 g
Calories: -12.00 kcal
Protein: +0.00 g
Sugar: -2.48 g
Saturated Fat: +0.00 g
Sodium: +137.00 mg
Carbohydrates: -0.83 g
>>> %run -m food_data.compare -i 345534 604974
Tomatoes
Tomato Juice
Citric Acid
Calcium Chloride
- Less Than 2% of: Salt
+ Natural Flavor
+ Dried Onion
+ Dried Garlic
+ Salt
+ Olive Oil
+ Spices
+ Soybean Oil
sys.argv
are
strings and may need to be converted.__name__
to see if we
are running the module as a script.For this part, you will update the food_data.find
module
and add the ability to filter by category. Create a new
function filter_by_category
that will filter data items by
branded_food_category
. Then, update the command-line
program so that a user can add an optional argument
-c
followed by a category. The -c
parameter
can be before or after the -d
or -b
flags.
Update the usage function to reflect this addition. Sample Output:
>>> %run -m food_data.find -c "Cookies & Biscuits" -b "The Quaker Oats Company"
525081 The Quaker Oats Company PEANUT BUTTER OATMEAL COOKIES
713124 The Quaker Oats Company PEANUT BUTTER SANDWICH MINIS FILLED BISCUIT BITES, PEANUT BUTTER
>>> %run -m food_data.find -b "The Quaker Oats Company" -c "Cookies & Biscuits"
525081 The Quaker Oats Company PEANUT BUTTER OATMEAL COOKIES
713124 The Quaker Oats Company PEANUT BUTTER SANDWICH MINIS FILLED BISCUIT BITES, PEANUT BUTTER
sys.argv
to
determine which flag is being used.diff_nutrition
method that
indicates whether to scale the nutrition details according to the
serving sizes before computing the difference, and modify the code as
necessary, maintaining the original behavior if the flag is off. Then
add a command-line flag to the compare program to allow users to toggle
this on or off. For example, if item A has a 200g serving size and item
B a 100g serving size, and A has 4g sugar and B has 2g sugar, scaling B
to the same 200g serving size would give it 4g sugar and make the diff
0g. Scale all nutrition info like this before computing
differences.