I just started business school at Duke University two months ago, and it’s been amazing! I feel like I’ve already made lifelong friends, and there are lots of great events to kick things off with the business school as well as fun things to do in the Raleigh-Durham area.
Our program, the Master of Quantitative Management (MQM), recently hosted its Summer Data Competition. The basic idea was to produce an interesting data set (ostensibly not including any insights taken from it) using any means available. We’d be judged on criteria like originality, cleverness, and usability/potential for insights – of course, demonstrating that potential means performing at least some analysis yourself…
An entry I made with my friend Mayank ended up making it into the finals. I thought the idea was really cool. Here’s what we did:
Premise
Like many students, I’ve been trying to maintain a good diet on a low budget, and I’ve come to notice a basic, inescapable dilemma for all eaters. Basically, you can eat cheaply, eat healthily, or eat out. Pick two. Students/early career folks like me generally end up sacrificing the convenience and time savings of having someone else make our meals in favor of cost savings.
If we’re lucky, we also get to maintain our overall health. It’s obviously not guaranteed you even get two of them. The broke college student chowing down on instant ramen every night is a cliché for a reason.
There are a plethora of reasons as to why it could be difficult to cook healthy meals for yourself all the time, especially when you’re low on ingredients, money, or you have to follow some specific diet or nutritional guidelines. But sometimes, it’s just because it isn’t obvious which recipes are healthy or not just by looking at the ingredient list. You might notice that the vast majority of recipes don’t include nutrition facts, and the ones that do have narrow selections and mostly include health-first, taste-second recipes. That’s no good.
A lack of easily accessible basic nutritional information for common recipes should never be a reason to sacrifice your health. We thought that, with some simple data transformations, it would be possible to scrape nutritional information for recipes online.
Introduction
Our dataset focuses on the nutritional profiles of publicly available food and drink recipes on various popular culinary websites; we chose to focus on US-based recipe catalogues to avoid language confusion and to ensure a stronger cultural grasp of the recipes we analyzed.
The record is arranged into 2976 rows and 19 columns, with each row corresponding to a given recipe entry. Five columns are reserved for recipe metadata (e.g. title, average rating, url, etc), and the remaining are nutrition based. We used the USDA Food Composition Databases API to access nutritional information for each ingredient, then applied natural language processing techniques to parse the units of measurement according to each recipe – think pounds, cups, teaspoons, or grams – and converted each to a mass standard that the API could retrieve.
Data Acquisition
While the data acquisition process was relatively straightforward in principle, our team had to overcome significant technical hurdles to obtain the recipe data and convert it to useful nutritional information.
First, we needed to design a web crawler that found directories in target websites that matched a particular signature pointing only to recipe pages. After tinkering for a while, we found that most of the sites we tested had a “recipe” tag in their url path that made this distinction obvious. We used dirhunt, a command-line open source site analyzer that attempts to compile all directories while minimizing requests to the server (Nekmo).
Next, we needed to scrape the data from each recipe URL. We ended up using recipe-scrapers, an open-source Python package for gathering basic data from popular recipe site formats (hhursev). This package gave us easy access to the recipe titles, average ratings, and ingredient lists, among other important data.
Critically, the ingredients were formatted as a Python list of strings in their raw delineation. For instance, one item could look like “1 1/2 cups unbleached white flour”. We needed to first convert the “1 1/2” into a proper floating point, as well as change all measurements into the standard grams that the USDA nutritional database requires. Python offers a “fraction” package for converting strings of fractions into floating point numbers, as well as a “word2number” package for converting strings to numbers (e.g. “three” to 3).
We wrote a lookup table for converting all masses into grams, as well as all volumes into grams based on the ingredient type. For volume-based ingredients not found in our lookup table, the script defaulted to using the conversion factor for water (approx. 240 grams per cup), which proved to be a close estimate for a wide range of food types – most food is mostly water!
Finally, we used the USDA Food Composition Databases API to search for these ingredients and obtain nutritional data. The API allows for searching with natural language, though some foods were still impossible to find through the API; we decided to discard any recipes that had untraceable ingredients given the time restrictions of the competition.
The request limit on this API also meant that we were realistically limited to a few hundred recipes per site for our final dataset; we decided to spread relatively evenly over the sites to include a wide range of recipe influences.
Dataset Description
Recipes-Meta is a database of recipes scraped from popular websites with computed detailed nutrition data taken from USDA for each ingredient to potentially help consumers make more informed eating choices and offer insights in the relationships between ingredients, nutrients, and site visitor opinions. Each row is a recipe entry that can be uniquely referenced by its URL.
Columns:
Title: Name of recipe
Rating: Average rating of recipe as given by users (some sites do not have this feature)
URL: Web address of the recipe (unique/primary key)
Servings: Number of people the recipe serves, i.e. serving size (nutrition data is divided by this)
Ingredients: List of ingredients in the recipe
Energy (kcal): Total calories of the recipe per serving in kcal
Carbohydrate, by difference (g): Total carbohydrates of the recipe per serving in g
Protein (g): Total protein of the recipe per serving in g
Calcium, Ca (mg): Total calcium of the recipe per serving in mg
Cholesterol (mg): Total cholesterol in the recipe per serving in mg
Fatty acids, total saturated (g): Total saturated fat of the recipe per serving in g
Fatty acids, total trans (g): Total trans fats of the recipe per serving in g
Fiber, total dietary (g): Total dietary fibre of the recipe per serving in g
Iron, Fe (mg): Total iron content of ingredients used in the recipe per serving in mg
Sodium, Na (mg): Total sodium of ingredients used in the recipe per serving in mg
Sugars, total (g): Total sugar content of recipe per serving in g
Total lipid (fat) (g): Total lipids/fats of the recipe per serving in g
Vitamin A, IU (IU): Total Vitamins A of the recipe per serving in IU
Vitamin C, total ascorbic acid (mg): Total Vitamin C of the recipe per serving in mg
- Red indicates nutrition related data
- Blue indicates recipe related recipe related data
Potential Insights
There exists a critical gap between growing consumer demand for health-conscious eating options and readily available nutrition data for recipes online. Most consumers looking to eat balanced, tasty, and affordable meals while meeting their health goals must eventually learn to cook their own meals. However, convenient data to make informed choices for recipe-based meal planning does not exist for most popular recipe sources online.
We also noticed that the few websites that do show nutrition data for their recipes are geared towards consumers that already follow a diet plan or practice healthy eating as a part of their lifestyle. Further, these websites are often limited in scope, including only a small set of specific recipe choices or community-generated recipes from a small user base.
Considering that access to healthy eating options and food education in America is growing increasingly unequal, our approach to spreading awareness about nutrition aims to target the ‘average eater’ or general public (Hiza et al.). This requires us to access nutrition data for a wide range of popular websites, rather than the few that readily offer this information. While our algorithm is not perfect, it can serve as a starting point and proof-of-concept for similar endeavours in the future.
We suggest the following potential insights, though there are many more viable routes for analysis:
- Determine if “healthiness”/nutrition stats somehow relate to the average rating of recipes.
- Generate a custom list of recipes that fit a specific range of macronutrients (protein/carbs/fats).
- Define overall nutrition metrics in all recipes, for example, to find meals that have an especially high protein to calorie ratio.
- Check if recipes that include certain ingredients tend to be more or less healthy.
- Analyze which websites tend to post healthier and/or more well balanced recipes.
- Produce a nutritional ranking of all recipes according to their adherence to USDA guidelines (or some other metric).
- Flag recipes with especially high sugar, fat content, trans fat or cholesterol content and make this flag obvious if and when it is retrieved.
- Write an algorithm that generates customized eating plans that meet daily nutritional requirements.
For a more business oriented goal, this data could be integrated with personalised consumer-level data to design customized eating plans that follow individual nutritional requirements based on height, age, weight, BMI, or other factors. There are surely many interesting possibilities we did not discuss in this report. Happy hacking.
Sources
Hiza, Hazel A.B. et al. “Diet Quality of Americans Differs by Age, Sex, Race/Ethnicity, Income, and Education Level”. Journal of the Academy of Nutrition and Dietetics, Volume 113, Issue 2, 297 – 306.
hhursev. “recipe-scrapers – Python package for scraping recipes data”. Github Repository, 2019, https://github.com/hhursev/recipe-scrapers.
Nekmo. “Dirhunt – Find web directories without bruteforce”. Github Repository, 2019, https://github.com/Nekmo/dirhunt.
Recipe websites referenced: