Skip to the content.

Author: Ishaan Chadha


Introduction

The following dataset was downloaded from food.com as two datasets: one containing recipes and their specific attributes and another with ratings and their reviews of recipes. We merged these datasets into one to be able to answer the following question: Is there a relationship between the number of steps in a recipe and the average rating of that recipe?

This is an important question in the world of cooking. By looking at recipes and their respective average rating, this question will provide some insight as to whether the number of steps in a recipe factors into its rating. This can influence cooks to edit the number of steps in their new recipes in the hopes of a greater rating by reviewers.

After cleaning the data, this dataset has 234,429 rows and 26 columns (234,429 rows and 18 columns prior to cleaning the data). In order to get familiar with the dataset, here are some important columns that will be mentioned:


Data Cleaning

There is a list of data cleaning steps that were followed in order to make the analysis of the relationship between the number of steps in a recipe and its average rating easier and more accurate.

Overall, by following these steps and cleaning the data, we made the analysis of the relationship between the number of steps in a recipe and its average rating easier and more accurate.


Univariate Analysis

In this analysis, we examine the box plot representation of the “n_steps” column, which provides insights into the distribution of step lengths in recipes. The box plot reveals that, on average, recipes tend to have around 6 to 13 steps. The box plot’s median line represents the middle value of the dataset, showing that recipes typically have a step count of 9. A lower fence of 1 and an upper fence of 23 indicates that the length for almost all recipes lies between 1 and 23 steps. However, outliers are observed with step counts reaching 100, suggesting the presence of a few recipes with significantly more complex procedures.


Bivariate Analysis

The plot above is a scatter plot of n_steps versus average_rating (where duplicate id rows were dropped as n_steps and average_rating are the same for the same recipe). From the scatter plot, we can see a general trend where recipes with more steps have a lower proportion of recipes with a low average_rating compared to recipes with less steps; this would also convey that recipes with more steps have a higher proportion of recipes with a higher average rating compared to recipes with less steps. This graph, therefore, possibly indicates that–with further analysis needing to prove it–recipes with more steps, on average, are more likely to have a higher average_rating.


Data Aggregate

The pivot table above looks at the distribution of average ratings (grouped together in intervals of 1 to better visualize any possible trends) for each n_steps value in the dataframe. This is interesting as it allows one to potentially point out any possible relationship between the two variables to further explore.

Sidenote: The column for 71 steps is all 0’s, which is an outlier as the sum of everyone column should be 1; this is due to the fact that the only recipe with 71 steps had no review, so we replaced the np.NaN values with 0’s.


NMAR Analysis

The missingness in the review column could potentially be classified as NMAR based on the premise that the reviewer might have been too lazy to write out an explanation for their rating of the recipe id.

Reasoning: Depending on how the reviewer felt when filling out their review, they could have been too lazy, too tired, or just accidentally failed to fill out the review column. This would lead to a np.NaN value for their review of the recipe id. This would indicate that the review column is NMAR.

On the other hand, review could possibly be proven as MAR if more data is collected. From more data being collected, we could possibly prove that this column is MAR based on the following columns in their respective scenarios:

  1. user_id: In this scenario, we would get more reviews from the same reviewers. If the same users are neglecting to leave a review when reviewing a recipe, we could label review as MAR due to this column being dependent on user_id for missingness.

  2. rating: In this scenario, we would get more reviews in general from any user_id, new or already in the dataframe. Depending on someone’s emotional state when filling out a review (enraged, neutral, or happy), this could impact whether they leave review blank or not. If someone was very dissatisfied with the recipe, they could leave a 1 star rating and be done with their review as they could feel that their rating conveys their feelings. If someone was content or neutral with the recipe, they could leave 3 stars and no review as they may not have anything to add beyond the rating. And people happy with the recipe could leave 5 stars and no review due to the fact that they feel that the stars convey their satisfaction. Therefore, we could label review as MAR due to this column’s missingness being dependent on rating if a trend between the two columns is present.

  3. id: In this scenario, we would get more reviews for the same recipes, id, in the dataset. Perhaps more common recipes, like a grilled cheese, may not elicit a review due to the fact that the reviewer felt that the recipe is too common and does not evoke a need for a review beyond a rating. Therefore, we could label review as MAR due to this column’s missingness being dependent on id.

By incorporating these additional data and strategies, it may be possible to explore alternative explanations for the missingness in the review column and evaluate whether the missingness can be considered MAR rather than NMAR. This deeper understanding can guide appropriate data handling techniques and mitigate potential biases caused by missing values during subsequent analysis.


Missingness Dependency Analysis

We are exploring whether description is MCAR or MAR. We used the following permutation tests to help us decide:

First Test

Null hypothesis: The description column is not dependent on n_steps column.

Alternative hypothesis: The description column is dependent on n_steps column, indicating that description is MAR.

Test Statistic: Absolute difference in mean of n_steps between rows with a description and rows with a missing description.

alpha = 0.05

P-value = 0.207

Conclusion: Because p-value = 0.207 > alpha level = 0.05, we fail to reject the null hypothesis that the description column is not dependent on n_steps column. There is not statistically significant evidence that the description column is dependent on average_rating column.

Now, let us look at if the missingness description is dependent on average_rating column.

Second Test

Null hypothesis: The description column is not dependent on average_rating column.

Alternative hypothesis: The description column is dependent on average_rating column, indicating that description is MAR.

Test Statistic: Absolute difference in mean of average_rating between rows with a description and rows with a missing description.

alpha = 0.05

P-value = 0.006

Conclusion: Because p-value = 0.006 < alpha level = 0.05, we reject the null hypothesis that the description column is not dependent on average_rating column. There is statistically significant evidence that the description column is dependent on average_rating column. This indicates that that description is MAR as its missingness is dependent on average_rating.


Hypothesis Testing

As stated previously, we chose to explore the relationship between the number of steps in a recipe and the average rating of that recipe. To test this we set up the following permutation test:

We define short recipes as recipes with less than 9 steps and long recipes as those with 9 or more steps. This value was chosen carefully as to split the number of recipes in each category evenly (roughly 50% of the data in each category).

Null hypothesis: There is no observable relationship between the number of steps in a recipe and the average rating of that recipe.

Alternative hypothesis: There is a relationship between the number of steps in a recipe and the average rating of that recipe.

Test Statistic: Absolute difference in mean of average_rating between short and long recipes.

alpha = 0.05

P-value = 0.001

Conclusion: Because p-value = 0.001 < alpha level = 0.05, we reject the null hypothesis that there is no observable relationship between the number of steps in a recipe and the average rating of that recipe. There is statistically significant evidence that there is a relationship between the number of steps in a recipe and the average rating of that recipe. Further testing should be done to determine the strength and direction of this relationship.

Reasoning: In order to test our question, we needed to run a permutation test as we do not have both the population and a sample from that population. We only have two samples (short and long recipes) and no population. To run the permutation test, we chose to create two labels for the n_steps variable. As mentioned before, we chose 9 steps in order to split the number of recipes in each category evenly (roughly 50% of the data in each category). This unbiasedly split up the data for the permutation test. We chose the null and alternative hypotheses above as the null hypothesis is what we are currently operating under when running the test (there is no difference in the means of average_rating for different length recipes), whereas the alternative hypothesis is the alternative explanation for the results of the permutation test if the results disprove the null hpypothesis (there is a difference present). Because we are comparing two values (mean average_rating for short and long recipes), it is best for our test statistic to look at the absolute value of the difference between these two means. We chose an alpha level of 0.05 as this is a common alpha level for statistical tests. And because our p-value was less than the alpha level, we rejected the null hypothesis and stated that we had statistically significant evidence for the alternative hypothesis.


Framing the Problem

From the Recipes and Ratings dataframes, we decided to investigate the following prediction problem: is there a way to predict the rating of a recipe given by a reviewer based on the other columns in the merged dataframe? In order to predict the response variable that is rating–which is an integer that is 1, 2, 3, 4, or 5–we chose to utilize a classificaton model in the form of a DecisionTreeClassifier with multiclass classification. We chose to try to predict rating in order to gain some insight as to what influences a rating, from the aspects of the recipe to the reviewer. In order to evaluate the model, we will utilize the metric accuracy overall to see how the model preforms overall and accuracy for each of the response variable’s values to see any bias in predictions made. We used this over others as there is no instrinsically worse error to commit (false negative and false positive are both equally bad), which means precision and recall are not suitable; we also want to make it such that the model makes as many accurate predictions as possible, which accuracy measures. Furthermore, since we do not particularly care about precision and recall as much as we care about overall success of the model, F1-score is not all that suitable either, which lead us to choose accuracy as our evaluation metric. And we will also look into the model’s accuracy for each of the response variabale’s values to better see if our model is biased towards guessing any ratings over others.

At the time of prediction of the rating of a particular recipe and reviewer, we have access to the particular information of the recipe; for the recipe aspect, we can utilize the recipe id, description (both of the overall recipe, steps, and ingredients), time to make, number of steps, and number of ingredients of the recipe. As for the review, we can only access the reviewer identification (user_id) at the time of the prediction of rating. This is because reviewers usually register to rate a recipe beforehand or access a website to input their reviews, which would be tied to their user_id. All these predictor variables will help us to predict the rating given by the reviewer for that particular recipe.


Baseline Model

In our Baseline Model, we try to predict the ratings that different recipes on food.com are given.

In our model, we use a pipeline that consists of a preprocessing step followed by a decision tree classifier. The preprocessing step involves transforming the features before feeding them into the decision tree classifier. The preprocessor consists of a ColumnTransformer which applies different features to different columns of the dataset. We Standard Scale the calories (#) column and Function Transform the n_ingredients column. Since the n_ingredinets column is an ordinal column, we can’t One-Hot-Encode it. Thus, we use a FunctionTransformer to ordinally transform it. Our FunctionTransformer categorizes the number of ingredients into three categories: 0 for less than 7 ingredients, 1 for 7-10 ingredients, and 2 for more than 10 ingredients. We chose these splits as it evenly splits up recipes into three groups (each with about 33% of the datatset). The rest of the columns in the dataset are dropped and thus aren’t factored in while predicting the rating column of the dataset.

Features:

1. Quantitative features: calories (#) (one feature). It is a quantitative feature representing the number of calories in a recipe. It is standardized (scaled) using the StandardScaler.

2. Ordinal feature: n_ingredients (one feature). It categorizes the number of ingredients into three categories: 0 for less than 7 ingredients, 1 for 7-10 ingredients, and 2 for more than 10 ingredients. It is ordinally transformed using the FunctionTransformer.

The model is split randomly into a training model (0.8) and test model (0.2) to make sure that we don’t overfit our model.

The performance of the model is evaluated using accuracy, which is calculated by the score method of the pipeline on the test set. The accuracy represents the proportion of correctly predicted ratings out of all the predictions made. We get an accuracy of approximately 74% when we run this model on our test dataset.

We believe that our model isn’t up to the mark because it is highly biased towards predicting ‘5’ as the rating of the recipes. Our model over predicts 5’s and underpredicts all the other ratings (due to the fact that a 5 rating is so common in the dataset). We believe that adding more features to our model that help distinguish between the rating of a recipe as a ‘5’ and all the other ratings would greatly help our model predict other ratings and increase the accuracy of our model.

Additionally, we need to improve upon the accuracy of the overall model. A good baseline to compare our model’s accuracy to is the rate of 5’s in the dataset due to it being so common; our baseline model has an accuracy of 74%, whereas the number of 5’s in the testing dataset is roughly 77%. Therefore, our model is not good right now, but we can greatly improve on the accuracy (and thus the overall effectivenss) of our model by adding more valuable predictor variables.


Final Model

The features we added to the model are:

1. ‘Nutrition Data’: We used a StandardScalar to standardize the nutritional data such as ‘calories’, ‘total fat’, ‘sodium’, and ‘sugar’ to then use to predict the rating of the recipe. We believe these are important because reviewers who care about nutrition might give recipes with high calories or high fat a lower rating.

2. ‘n_steps’: We used a Binarizer to convert this feature into a binary variable based on a threshold of 8. This feature represents the number of steps required to prepare the recipe. By converting it into a binary variable, the model can capture whether a recipe has a relatively small number of steps or a larger number of steps, which may affect the cooking time or complexity of the recipe. We feel that the number of steps matter because a recipe with too many steps might cause the reviewers to get exhausted and give a bad review.

3. ‘minutes’: We used a Binarizer to convert this feature into a binary variable based on a threshold of 100. This feature represents the total cooking time in minutes. By converting it into a binary variable, the model can differentiate between recipes with shorter cooking times and those with longer cooking times. This can be useful in capturing potential correlations between cooking time and the target variable. We feel that the cooking time matters because a recipe that takes too long to make might overcomplicate a recipe, which leads to a negative impact on rating by the reviewer.

4. ‘user_id’: We used a Binarizer to convert this feature into a binary variable based on a threshold of 85 reviews; this separates reviewers by experience (relevance in the dataframe). We chose 85 as this splits the total reviews handed out into two groups. Thus, this feature represents reviewers by experience. By converting it into a binary variable, the model can differentiate between recipes being reviewed by less experienced reviewers and those with more experience. This can be useful in capturing potential correlations between the reviewer level of experience and the target variable. We feel that the reviewer experience matters because a reviewer with more experience could potentially tend to hand out more generous ratings.

5. ‘n_ingredients’: We applied a custom transformation using a FunctionTransformer. This feature represents the number of ingredients required for the recipe. By transforming this feature into three equal-szie categories, we can extract additional information, such as whether a recipe has a small number of ingredients or a larger number of ingredients. We believe that, like minutes and n_steps, more complex recipes could negatively impact ratings, which is represented by steps with more ingredients. This can be relevant as recipes with a higher number of ingredients might require more effort or be more complex.

6. Categorical Features: We used OneHotEncoder to encode the common words that were in the name column of the dataset. We found out that there are certain very frequent words in names which could have a high correlation with the ratings column such as sweet treats like ‘brownies’ and ‘cookies’ and healthier recipes like ‘salad’ and ‘fruits’, among others. We added these common words as columns to our dataframe with ‘0’ representing that this word wasn’t present in the recipe name and ‘1’ representing that the word is present. By one-hot encoding these variables, the model can capture the relationships between different words and their impact on the rating of the recipe. This can be revelant as certain kinds of recipes like desserts could have higher/lower ratings that healtheir recipes like salads and those with fruits.

The modeling algorithm we chose for this task is the Decision Tree Classifier. Decision trees are suitable for both classification tasks and handling a mix of continuous and categorical features. They can capture complex interactions and non-linear relationships in the data.

To select the best hyperparameters, we used the GridSearchCV function, which performs an exhaustive search over the specified parameter values using cross-validation. The model’s performance was evaluated based on accuracy, which measures the proportion of correct predictions. The hyperparameters that were tuned using grid search are:

  1. max_depth: This parameter determines the maximum depth of the decision tree. It controls the complexity of the model and helps prevent overfitting. The values tested are 10, 15, 25, 30, 50. We saw which values of max_depth were actually being implemented in the best fit of the model and dependent on that chose the values of 10, 15, 25, 30, 50.

  2. min_samples_split: This parameter specifies the minimum number of samples required to split an internal node. It affects the decision tree’s ability to capture fine-grained patterns in the data. The values tested are 100, 200, 250, 275, 300. We saw which values of min_samples_split were actually being implemented in the best fit of the model and dependent on that chose the values of 100, 200, 250, 275, 300.

  3. criterion: This parameter defines the quality measure used to evaluate the splits in the decision tree. The two options tested are ‘gini’ and ‘entropy’. ‘Gini’ measures the impurity of a node, while ‘entropy’ calculates the information gain.

The hyperparameters that gave the best results were:

Compared to the baseline model, the final model includes additional features and uses a more advanced modeling algorithm. By incorporating information about the number of steps, cooking time, and number of ingredients, as well as one-hot encoding the categorical features of the names, the model can capture more nuanced patterns in the data. Additionally, the decision tree classifier can handle non-linear relationships and interactions between features. The grid search helped in finding the best hyperparameters, resulting in a model that is fine-tuned for the data. Overall, the final model’s performance shows a significant improvement over the baseline model in overall accuracy due to the inclusion of these additional features and the use of a more sophisticated algorithm. However, this comes at the cost of the bias of guessing 5’s more often; this is due to the overall frequency of 5’s in the dataset, leading to possible trends being overlooked by this fact.


Fairness Analysis

We decided to look into whether our model preform differently based on the number of minutes of a recipe. To further investigate this, we set up the following permuation test:

Group 1: Short recipes: recipes that take 35 minutes or more to make

Group 2: Long recipes: recipes that take less than 35 minutes to make

We chose 35 minutes as the divider between the two groups as this is the value that most evenly splits up the dataset into the groups (roughly 50% of the dataframe to_work is in each group).

Evaluation Metric: Accuracy

Test Statistic: abs((Accuracy of recipes that take less than 35 minutes) - (Accuracy of recipes that take 35 minutes or more))

Significance Level: alpha = 0.05

Observed test statistic: 0.0143

p-value = 0.00

Because our p-value = 0.00 is less than the significance level of alpha = 0.05, we can reject the null hypothesis. There is statistically significant evidence that our model could be unfair, specifically that its accuracy for recipes that take less than 35 minutes could be not equal to its accuracy for recipes that take 35 minutes or more. It should be noted that due to not having randomized controlled trials, we cannot make definitive conclusions about the truth or falsehood of either hypothesis.