Calorie Hunters :D
This is a project for the Winter 2023 offering of DSC 80 at UC San Diego, conducting EDA on recipe and review data from food.com.
Members: Kurumi Kaneko, Candus Shi
Introduction
For this project we were provided two datasets from food.com. One was about recipes with numerical and textual data like minutes, number of steps/ingredients, and tags. The other dataset was about interactions which include ratings and reviews for specific recipes. Our analysis was focused on answering the question: What types of recipes tend to have the most calories?
This is an important question because food impacts everybody, and it is essential to become more aware of what foods to eat while maintaining healthy eating standards. Disclaimer: we do not intend our results to validate or promote any specific diets.
In the recipe dataset, there are 83,782 rows where each row represents one recipe. Relevant columns include:
name
: the name of the recipe,id
: a unique number representing the recipe,tags
: a list of tags chosen by the creator,nutrition
: a list of nutritional values in the form—[calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), and carbohydrates (PDV)] (PDV—Percent of Daily Value),n_steps
: the number of steps in the recipe,n_ingredients
: the number of ingredients used in the recipe, anddescription
: an explanation of the recipe provided by some creators.
In the interactions dataset, there are 731,927 rows where each row represents one review of a recipe. Relevant columns include:
recipe_id
: a unique number representing the recipe,rating
: a rating (1 to 5 stars) for the recipe given by the reviewer.
Cleaning and EDA
Data Cleaning Steps:
- We left merged the recipes and interactions datasets and filled all ratings of 0 with
np.nan
. This is appropriate to do because it is not necessarily the case that the actual review/rating was 0-stars (i.e. the worst rating possible), but the reviewer could be asking a question or state their rating in the review text; additionally, some reviewers stated that the site was not allowing them to click the stars (glitch in the system), and therefore was forced to not provide a rating. - Using the merged dataset, we found the average rating per recipe and added this to the recipes dataset. We then proceed to use the recipes dataset with the average rating column.
- We added all of the individual nutritional values as their own column using the values from the
nutrition
column. - We identified most of the numerical data type columns from the recipes dataset that were relevant to our analysis. These are:
name
,id
,minutes
,n_steps
,n_ingredients
,average_rating
,calories
,total_fat_%
,sugars_%
,sodium_%
,protein_%
,saturated_fat_%
, andtotal_carbohydrate_%
. We then proceed to use a dataframe consisting of these columns. - Since the
average_rating
column was a quantitative continuous variable, we decided to bin the values into 0.5-width intervals of ratings and treat it as a categorical variable in theavg_rating_bins
column. One limitation of binning average rating is that we lose data granularity so the following analysis may not produce the most accurate results.
Clean Dataset:
name | id | minutes | n_steps | n_ingredients | average_rating | calories | total_fat_% | sugars_% | sodium_% | protein_% | saturated_fat_% | total_carbohydrate_% | avg_rating_bins |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 brownies in the world best ever | 333281 | 40 | 10 | 9 | 4 | 138.4 | 10 | 50 | 3 | 3 | 19 | 6 | 4.0 - 4.5 |
1 in canada chocolate chip cookies | 453467 | 45 | 12 | 11 | 5 | 595.1 | 46 | 211 | 22 | 13 | 51 | 26 | 4.5 - 5.0 |
412 broccoli casserole | 306168 | 40 | 6 | 9 | 5 | 194.8 | 20 | 6 | 32 | 22 | 36 | 3 | 4.5 - 5.0 |
millionaire pound cake | 286009 | 120 | 7 | 7 | 5 | 878.3 | 63 | 326 | 13 | 20 | 123 | 39 | 4.5 - 5.0 |
2000 meatloaf | 475785 | 90 | 17 | 13 | 5 | 267 | 30 | 12 | 12 | 29 | 48 | 2 | 4.5 - 5.0 |
Univariate Analysis
This is a histogram showing the proportion of recipes with a certain range of calories. Each bin is of width 50 calories. The graph is heavily skewed right, due to huge outliers, such as a recipe with ~45,000 calories (wow!). Most of the recipes have 0 to 2,500 calories.
The histogram above shows the bulk of the data (0 to 2,500 calories) in a more visually appealing manner.
This is a bar chart showing the frequency of recipes with an average rating between a certain range. Most recipes are decent; they are on average rated between 4.0 and 5.0 stars.
Bivariate Analysis
This is a group of box plots showing the distribution of calories for each average rating bin. Observe that there are many outliers especially for higher ratings, but most of the bins have a similar distribution.
The box plot above shows the bulk of the data (0 to 1,500 calories) in a more visually appealing manner. The interquartile range of all bins is around 150 calories to 500 calories. Note that there is a lot more data for 4.0-4.5 and 4.5-5.0 bins as shown in the bar chart in the above section. This means that there is an uneven distribution of recipes among the average rating bins, so it may not be correct to compare the box plots equally on their own.
This is an overlaid histogram showing the distribution of recipes with the low-calorie
tag compared to recipes without the low-calorie
tag. As one might expect, the distribution of the low-calorie
distribution is tighter around its center (i.e. has a lower variance) compared to the distribution of recipes without the tag.
The histogram above shows the bulk of the data (0 to 1,500 calories) in a more visually appealing manner.
Interesting Aggregates
Below is a pivot table of median number of calories per number of steps and average rating bins.
avg_rating_bins | 1-5 steps | 6-10 steps | 11-15 steps | 16-20 steps | 20+ steps |
---|---|---|---|---|---|
1.0 - 1.5 | 178.2 | 258.1 | 303.35 | 404.3 | 481.5 |
1.5 - 2.0 | 119.45 | 305.65 | 338.5 | 559.6 | 411.55 |
2.0 - 2.5 | 267.6 | 282.9 | 368 | 410.85 | 407.5 |
2.5 - 3.0 | 174.3 | 280.6 | 315.6 | 321.65 | 336.75 |
3.0 - 3.5 | 245.5 | 305.2 | 348.6 | 382.3 | 451.7 |
3.5 - 4.0 | 248.9 | 306.4 | 332.9 | 364.4 | 459 |
4.0 - 4.5 | 235.6 | 299.2 | 359.3 | 388.3 | 440.85 |
4.5 - 5.0 | 215.5 | 293.6 | 348.5 | 386.3 | 444.6 |
There is a general trend in an increase of median calories as the number of steps increases. This indicates that recipes with more steps seem to have higher calories according to this dataset. Among each interval of steps, the average rating does not vary much in median of calories, but for each average rating bin (except for 1.5-2.0) there is an increase in median calories with an increase in steps. This is more easily visualized in the bar chart below.
Assessment of Missingness
NMAR
We believe that the average_rating
column is NMAR. The values of average_rating
may be missing if any rating for a recipe was 0 (which we then replaced with np.nan
in the data cleaning steps) as this would affect the computation of the average because taking the average of values where at least one value is nan results in an average that is also nan. Thus, it depends on the individual ratings itself which in turn makes the missingness of average_rating
dependent on its values. Hence, it is NMAR.
The average_rating
column could become MAR if we had included the rating
column from the interactions dataset.
MAR/MCAR
The column with the most missing values was the average_rating
column (2,609 missing values!). We determined that it was MAR on calories
and MCAR on sodium_%
.
MAR on calories
- \(H_0\): The missingness of
average_rating
does not depend oncalories
. - \(H_1\): The missingness of
average_rating
does depend oncalories
. - test statistic: absolute difference in means
- significance level: \(\alpha = 0.01\)
- conclusion: since the p-value \(= 0.0 < \alpha = 0.01\), we reject the \(H_0\) and say that it seems like the missingness of
average_rating
depends oncalories
.
Given our dataset, the chance of getting the observed statistic or a more extreme absolute difference in means under the null hypothesis is statistically significant. Thus, the number of calories in a recipe affects the missingness of average_rating
.
MCAR on sodium_%
- \(H_0\): The missingness of
average_rating
does not depend onsodium_%
. - \(H_1\): The missingness of
average_rating
does depend onsodium_%
. - test statistic: absolute difference in means
- significance level: \(\alpha = 0.01\)
- conclusion: since the p-value \(= 0.894 > \alpha = 0.01\), we fail to reject the \(H_0\) and say that it seems like the missingness of
average_rating
does not depend onsodium_%
.
Given our dataset, the chance of getting the observed statistic or a more extreme absolute difference in means under the null hypothesis is not statistically significant. This test does not provide any useful information about our question.
Hypothesis Testing
We conducted a permutation test on n_ingredients
(categorized by “many ingredients”) and calories
. We define “many ingredients” to be n_ingredients
> 10. We are doing this test to see whether the number of ingredients in a recipe can tell us if such recipes tend to have higher calories. We chose 10 ingredients as our threshold by looking at the distribution of number of ingredients (in the histogram below) and eye-balling an approximate balancing point.
- \(H_0\): The distribution of calories for recipes with 10 or fewer ingredients is the same as the distribution of calories for recipes with more than 10 ingredients.
- \(H_1\): The distribution of calories for recipes with 10 or fewer ingredients is different from the distribution of calories for recipes with more than 10 ingredients. Specifically, recipes with more than 10 ingredients have more calories than recipes with 10 or fewer ingredients.
- test statistic: difference in group medians
- significance level: \(\alpha = 0.01\)
- conclusion: since the p-value \(= 0.0 < \alpha = 0.01\), we reject the \(H_0\) and say that it seems like the distribution of calories for recipes with 10 or fewer ingredients is different from the distribution of calories for recipes with more than 10 ingredients.
Given our dataset, the chance of getting the observed statistic or a more extreme difference in group medians under the null hypothesis is statistically significant. We can also say that recipes with more than 10 ingredients have more calories than recipes with 10 or fewer ingredients. Thus, to answer our main question, recipes with higher calories may have a higher number of ingredients.
Honorable mention:
name | id | minutes | contributor_id | submitted | tags | nutrition | n_steps | steps | ingredients | n_ingredients | average_rating |
---|---|---|---|---|---|---|---|---|---|---|---|
how to preserve a husband | 447963 | 1051200 | 576273 | 2011-02-01 | [‘time-to-make’, …, ‘for-1-or-2’, …, ‘low-in-something’, …] | [407.4, 57.0, 50.0, 1.0, 7.0, 115.0, 5.0] | 9 | […, “don’t choose too young”, …, ‘keep warm with a steady fire of domestic devotion and serve with peaches and cream’, …] | [‘cream’, ‘peach’] | 2 | 5 |