Introduction
The aim of this project is to build a machine learning model that can
predict the user rating of beauty products sold at the Sephora online
store. Sephora, a popular retailer of personal and beauty products,
launched its online store in 1999 and sells items such as cosmetics,
skincare, hair care, and fragrances.
Imagine scrolling through hundreds of products on the Sephora
website–each offering different advantages while being at contrasting
price points. When it comes to beauty and skincare, obtaining accurate
insights on whether a product is right for individual interests and skin
types is invaluable. “Which of these is right for me? Which product is
truly worth spending on?” These are questions I often ask myself when
looking for a new product to purchase, even with any other online
retailer.
Fortunately for shoppers, like myself, knowing that consumers with
similar product preferences have already tried and purchased the item
makes all the difference in making the right purchasing decision. In
moments like these, relying on user ratings and reviews provides
promising reassurance in making the best choice.
I noticed that a product rating on the Sephora website only consists
of 5 stars that are filled accordingly on a continuous scale of 1 to 5
(lowest to highest). I started to wonder, “What exactly drives a user to
give a product a high or low rating?” I realized there are numerous
factors that could impact this even beyond the product’s performance
like its brand, popularity, and even whether or not it is an exclusive
item.
I would like to predict how a rating is determined by users based on
these characteristics and to uncover trends or patterns from Sephora
product ratings. By understanding how product factors contribute to high
or low ratings, we can be better informed on shopper decision
making.
The dataset that I am using for the project was found on Kaggle, “Sephora
Products and Skincare Reviews.” It consists of information on over
8,000 products from the Sephora online store, and was scraped by Nady
Inky via Python scraper in March 2023.
Project Outline
First, we will load the packages necessary for the project and the
products
dataset we are using. To clean up data values, we
will tidy any missing observations. Now, the data will be ready to
explore through EDA (Exploratory Data Analysis). We will be analyzing
the different relationships between our predictor variables and our
response variable, rating
. We can then build our recipe
using the variables we are interested in for our model to use in
prediction and split our data into training and testing. Before we build
any models, it is useful to perform k-fold cross validation on our
training data. Now, we can finally set up our 6 models: Linear
Regression, K-Nearest Neighbors, Elastic Net, Decision Tree, Random
Forest, and Boosted Trees. After applying these models to our folds, we
will determine the best performing model and its parameters based on the
Root Mean Squared Error (RMSE), and fit it to our training and testing
data. Once this is completed, we can see how well our model performs in
predicting the rating
for Sephora products!
Loading Packages and Data
First, let’s load our packages and raw Sephora products
data.
# loading all packages
library(tidyverse)
library(dplyr)
library(tidymodels)
library(kknn)
library(corrplot)
library(ggplot2)
library(visdat)
library(finalfit)
library(kableExtra)
library(naniar)
library(forcats)
library(pals)
library(carat)
library(vip)
library(yardstick)
library(ranger)
library(xgboost)
library(wesanderson)
# loading data
products <- read_csv("data/product_info.csv")
tidymodels_prefer()
Exploring and Tidying the Raw Data
Now we can start exploring the dataset and its values.
# determining rows and columns number
dim(products)
## [1] 8494 27
There are 8494 observations in the dataset with 27 variables. Let’s
take a look at whether the data has missing information to determine how
to handle this.
# view the summary for missing values
products %>%
miss_var_summary() %>%
kable() %>%
kable_material(c("striped", "hover","condensed")) %>%
scroll_box(width = "100%", height = "350px")
variable
|
n_miss
|
pct_miss
|
sale_price_usd
|
8224
|
96.8
|
value_price_usd
|
8043
|
94.7
|
variation_desc
|
7244
|
85.3
|
child_max_price
|
5740
|
67.6
|
child_min_price
|
5740
|
67.6
|
highlights
|
2207
|
26.0
|
size
|
1631
|
19.2
|
variation_value
|
1598
|
18.8
|
variation_type
|
1444
|
17.0
|
tertiary_category
|
990
|
11.7
|
ingredients
|
945
|
11.1
|
rating
|
278
|
3.27
|
reviews
|
278
|
3.27
|
secondary_category
|
8
|
0.0942
|
product_id
|
0
|
0
|
product_name
|
0
|
0
|
brand_id
|
0
|
0
|
brand_name
|
0
|
0
|
loves_count
|
0
|
0
|
price_usd
|
0
|
0
|
limited_edition
|
0
|
0
|
new
|
0
|
0
|
online_only
|
0
|
0
|
out_of_stock
|
0
|
0
|
sephora_exclusive
|
0
|
0
|
primary_category
|
0
|
0
|
child_count
|
0
|
0
|
vis_miss(products)
Our dataset has quite a bit of missing values! There is missing data
for 14 of the 27 predictors. This includes predictors including
reviews
, size
, variation_type
,
variation_value
, variation_desc
,
ingredients
, value_price_usd
,
sale_price_usd
, highlights
,
secondary_category
, tertiary_category
,
child_max_price
, child_min_price
.
We should examine if there is a pattern of missingness between
variables. What’s notable in the missing data plot is
rating
and reviews
. There is 3.27% of data for
the response variable, rating
, that is missing, and we can
see that this same percent is missing for reviews
.
products %>%
count(across(c(reviews, rating), is.na))
While 19.3% of the data is missing, 80.7% is present and we can still
use that for our EDA process. Most of the missing data comes from
predictors sale_price_usd
, value_price_usd
,
and variation_desc
. Let’s drop these variables along with
others that have a high percentage missing or are also irrelevant to the
prediction.
# dropping predictors
column_drop <- c("sale_price_usd", "value_price_usd", "variation_desc", "child_max_price", "child_min_price", "highlights", "size", "variation_value", "variation_type", "tertiary_category", "ingredients", "child_count")
products = products[,!(names(products) %in% column_drop)]
# dropping rows with missing observations
products <- products[complete.cases(products), ]
Let’s convert any categorical variables to factors. Then, let’s take
a look our dataset now and its dimensions!
products$primary_category <- as.factor(products$primary_category)
products$secondary_category <- as.factor(products$secondary_category)
# head() to see first rows
products %>%
head() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
scroll_box(width = "100%", height = "200px")
product_id
|
product_name
|
brand_id
|
brand_name
|
loves_count
|
rating
|
reviews
|
price_usd
|
limited_edition
|
new
|
online_only
|
out_of_stock
|
sephora_exclusive
|
primary_category
|
secondary_category
|
P473671
|
Fragrance Discovery Set
|
6342
|
19-69
|
6320
|
3.6364
|
11
|
35
|
0
|
0
|
1
|
0
|
0
|
Fragrance
|
Value & Gift Sets
|
P473668
|
La Habana Eau de Parfum
|
6342
|
19-69
|
3827
|
4.1538
|
13
|
195
|
0
|
0
|
1
|
0
|
0
|
Fragrance
|
Women
|
P473662
|
Rainbow Bar Eau de Parfum
|
6342
|
19-69
|
3253
|
4.2500
|
16
|
195
|
0
|
0
|
1
|
0
|
0
|
Fragrance
|
Women
|
P473660
|
Kasbah Eau de Parfum
|
6342
|
19-69
|
3018
|
4.4762
|
21
|
195
|
0
|
0
|
1
|
0
|
0
|
Fragrance
|
Women
|
P473658
|
Purple Haze Eau de Parfum
|
6342
|
19-69
|
2691
|
3.2308
|
13
|
195
|
0
|
0
|
1
|
0
|
0
|
Fragrance
|
Women
|
P473661
|
Kasbah Eau de Parfum Travel Spray
|
6342
|
19-69
|
2448
|
4.4762
|
21
|
30
|
0
|
0
|
1
|
0
|
0
|
Fragrance
|
Women
|
dim(products)
## [1] 8209 15
Describing Predictors
These are 10 out of the original 27 variables that I have selected
from the Sephora products dataset which will be used to create the model
recipe. I decided to remove variables including brand_name
or product_name
so that our model can better generalize
predictions for new brands and products as well.
loves_count
: the number of people who have marked
the product as a ‘favorite’ on the site
reviews
: the number of user reviews for the
product
limited_edition
: binary value that indicates whether
the product is limited edition or not (1-true, 0-false)
new
: binary value that indicates whether the product
is new or not (1-true, 0-false)
online_only
: binary value that indicates whether the
product is only sold online or if it is also sold in store (1-true,
0-false)
out_of_stock
: binary value that indicates whether
the product is currently out of stock or not (1-true, 0-false)
sephora_exclusive
: binary value that indicates
whether the product is exclusive to Sephora or not (1-true,
0-false)
primary_category
: first category that the product
falls under on the site (ex. Fragrance, Bath & Body, Hair)
seconary_cateogry
: second category that the product
falls under on the site (ex. Women, Hair Styling & Treatment,
Sunscreen)
price_usd
: price of the product in US
Dollars
Visual Exploratory Data Analysis
Distribution of Rating
We can first take a look at the distribution of our response variable
rating
.
# creating histogram of rating
ggplot(products, aes(rating)) +
geom_histogram(binwidth = 1, fill = "rosybrown3") +
ggtitle("Histogram of Sephora Product Ratings") +
xlab("Rating") + ylab("Frequency") +
stat_bin(binwidth=1, geom='text', color='black', aes(label=after_stat(count)),
position=position_stack(vjust = 0.5)) + theme_minimal()
The histogram indicates that the ratings for Sephora products follow
a left-tailed distribution. The most common rating is 4, as 5119 of the
8209 products. The second highest rating for products is 5, with 2289 of
the products.
Variable Correlation Plot
We can utilize a correlation matrix of our numeric variables to
determine if there are relationships between variables.
# selecting numeric values
var_numeric <- products %>%
select(c(price_usd, rating, loves_count, reviews)) %>%
cor() %>%
corrplot(type = 'lower', diag = TRUE,
method = 'color')
While there appears to be some positive weak correlation between
rating
and reviews
for a particular product,
there appears to be nearly no correlation between the number of
reviews
and the price_usd
. We can notice,
however, that there is also a positive weak correlation between the
price_usd
and rating
. This is surprising to
me, since I believed there would be more of a strong positive
correlation between the number of reviews
and
rating
, as a greater number of reviews may indicate high
popularity of a product and thus a higher rating
. We can
see that there is a moderately high positive correlation between the
number of reviews
and loves_count
. However,
loves_count
only has a slightly positive relationship to
rating
, which means that it may not be a strong predictor
of it.
Product Prices
Let’s find more insight on the weak positive correlation between
price_usd
and rating
. I expect that products
with lower prices may have a higher rating as they are more affordable.
This is because cheaper products can be perceived to offer good value
for the price paid, especially when it performs better than expected for
that price point. On the other hand, some products that are expensive
can exceed expectations of the consumer and lead to higher ratings. We
can explore this relationship on Sephora’s online products to see if
these assumptions are true.
# creating scatterplot
ggplot(products, aes(x=price_usd, y=rating)) +
geom_point(alpha=0.3, color="rosybrown3") +
scale_x_continuous(breaks = pretty(products$price_usd, n = 10)) +
ggtitle("Scatterplot of Sephora Product Ratings and Price (USD)") +
xlab("Price (USD)") + ylab("Rating") + theme_minimal()
Wow, we can see that there are a lot of observations that fall
between 0 and 100 dollars. There are also many that fall from 200 to 400
dollars, and a couple of these have low average ratings. We also notice
that there is one observation that falls around $1900! That is pretty
expensive compared to the prices of all other products, so we can
consider this an outlier and remove it from our products
dataset. Removing outliers is important in data preprocessing since it
can affect the results of our model performance.
After we remove this observation, let’s take a closer look at the
relationship between price_usd
and rating
using another scatterplot.
# finding this exact observation and removing
products <- products[products$price_usd < 1900, ]
# plotting new scatterplot
ggplot(products, aes(x=price_usd, y=rating)) +
geom_point(alpha =0.3, color="rosybrown3") +
geom_smooth(formula = y ~ x,method = "lm", se = FALSE, color = "black", lty = 1, linewidth = 0.5) +
ggtitle("Scatterplot of Product Ratings and Price (without Outliers)") +
xlab("Price (USD)") + ylab("Rating") + theme_minimal()
Now we can take a much closer look at how the price_usd
of a particular product relates to its rating
. We can see
that there is a moderately positive relationship between the
price
and the rating
of the product. However,
it is important to note that lower prices also have fairly high ratings,
and that majority of the prices are lower. This indicates that the
relationship between prices and rating of a product is not as
necessarily as straightforward as we assumed, but we can still consider
it an important variable of rating
.
Sephora Exclusive Products
Sephora often has new products that are highly promoted on their
website and on social media. This promotion influences many consumers to
purchase, especially if that product is exclusive to Sephora. There is
even a section on their website that is dedicated to ‘Sephora Only’
brands, which includes higher end and celebrity brands (created by
celebrities like Rihanna, Selena Gomez, etc!).
To explore how sephora_exclusive
and products and
rating
may be related, I created a box plot that shows
their relationship.
# creating box plot
products %>%
ggplot(aes(x = factor(sephora_exclusive), y = rating, color = factor(sephora_exclusive))) +
geom_jitter(alpha = 0.2, width = 0.3, height = 0) + theme_minimal() +
labs(title = "Scatter Plot: Sephora Exclusive Products vs. Rating",
x = "Sephora Exclusive",
y = "Rating",
color = "Sephora Exclusive") +
scale_color_manual(values = c("0" = "rosybrown3", "1" = "seashell4"))
We can see that products that are exclusive to Sephora mostly have
ratings that are above 3, as many of the non-Sephora exclusive products
fall under a rating of 3. However, it is important to note that there
are more points that indicate the number of non-Sephora exclusive
products compared to those that are exclusive. This implies that
products marked sephora_exclusive
tend to have more ratings
above 3, but these are less in number of overall product count on the
website. Let’s keep this in mind as we continue our analysis.
Loves Count and Reviews
loves_count
refers to the number of hearts a particular
product has on the Sephora website. Users will select a product’s heart
icon as a ‘love’ if it is one of their favorites. It’s a great way for
shoppers to save and collect their best products for easier checkout.
Earlier, we saw that there is a moderately positive correlation between
the number of loves and the number of reviews
.
# creating scatterplot
ggplot(products, aes(x = loves_count, y = reviews)) +
geom_point(alpha = 0.2, color = "rosybrown3") +
geom_smooth(formula = y ~ x,method = "lm", se = FALSE, color = "black", lty = 1, linewidth = 0.5) +
labs(title = "Scatter Plot of Loves Count vs Reviews",
x = "Loves Count",
y = "Reviews") +
theme_minimal()
From this graph, we can see that there is a positive relationship of
reviews
and loves_count
, confirming our
original expectations. We can also note that the majority of these
points are clustered around 2500 reviews
and 2.5e+05
loves_count
. This again reinforces the idea that although
there is a positive correlation between these two variables, it is not
extremely strong.
New and Limited Edition Products
Typically, new
products will have a higher rating as
early consumers will be enthusiasts of that particular product or brand.
Like I mentioned before, new products are promoted throughout social
media on the Sephora Instagram account as well as through social media
influencers. In addition, some early reviewers could be receiving
incentives from trying that product, which would also lead to higher
ratings. I decided to use a boxplot to see if this is true.
# comparing new and limited edition using boxplot
products %>%
ggplot(aes(x = rating, y = factor(new), fill = factor(limited_edition))) + geom_boxplot() +
labs(
title = "Box Plot of New, Limited Edition Sephora Products and Rating",
y = "New",
x = "Rating",
fill = "Limited Edition") + scale_fill_manual(values = c("rosybrown3", "seashell3")) + theme_minimal()
We can see that new products do have a slightly higher average rating
around 4.5. If these products are limited_edition
, however,
they are slightly lower in average rating. It is also important to note
that there are some outliers with low ratings (1 and 2) but are also
limited edition and new. New, limited edition products have a larger
interquartile range, indicating that there is a wider distribution of
scores.
As for non-new products, the average rating is very close whether or
not it is a limited edition product or not. This implies that there may
not be a close relationship between these two variables and
rating
.
Category Comparisons
Sephora carries many types of products on their website. This
includes a range of fragrances for both men and women, makeup, skincare,
hair care, and other tools as well. To better understand what types and
how many categories we are working with, I want to visualize these
categories (secondary and primary), but first let’s see how many there
are in each.
# bar chart of primary
ggplot(products, aes(y = fct_infreq(primary_category))) +
geom_bar(fill = "rosybrown3") +
labs(
title = "Count of Each Primary Category") + theme_minimal()
# bar chart of secondary
products %>%
ggplot(aes(y = fct_infreq(secondary_category))) +
geom_bar(fill = "seashell3") +
labs(
title = "Count of Each Secondary Category") + theme_minimal()
We can see there 8 different primary categories that all the products
fall under. There are significantly more products that are ‘skincare’,
‘makeup’, ‘hair’, and ‘fragrance’ in comparison to ‘mini size’, ‘men’,
and ‘tools & brushes’. Most of the products fall under either
“Skincare” and “Makeup”. We should prioritize products that have a
higher count and we can group the others in a separate category.
We can also visualize the secondary categories for the products,
which have 41 different factors. There are many of these categories that
are similar, so we can collapse these as well to make the data more
concise. For example, “Tools” can account for products in “Tools”, “High
Tech Tools”, “Beauty Tools”, and “Hair Tools”. Let’s collapse these and
then combine products with a lower count into “Other”.
# lumping primary categories
products$primary_category <- fct_lump_n(
products$primary_category,
5,
other_level = "Other")
# collapsing secondary category
products$secondary_category = fct_collapse(products$secondary_category,
"Bath" = c("Bath & Body", "Bath & Shower"),
"Beauty" = c("Beauty Accessories", "Beauty Supplements"),
"Body" = c("Body Care", "Body Moisturizers"),
"Eye" = c("Eye", "Eye Care"),
"Hair" = c("Hair", "Shampoo & Conditioner", "Hair Styling & Treatments"),
"Lip" = c("Lip", "Lip Balms & Treatments"),
"Tools" = c("Tools", "High Tech Tools", "Beauty Tools", "Hair Tools", "Brushes & Applicators"),
"Face" = c("Lip", "Cheek", "Face"),
"Skincare" = c("Skincare", "Sunscreen"))
products$secondary_category <- fct_lump_n(
products$secondary_category, 17,
other_level = "Other")
# checking levels of secondary category after collapse
levels(products$secondary_category)
## [1] "Bath" "Tools" "Body"
## [4] "Candles & Home Scents" "Face" "Cleansers"
## [7] "Eye" "Hair" "Makeup"
## [10] "Masks" "Men" "Mini Size"
## [13] "Moisturizers" "Skincare" "Treatments"
## [16] "Value & Gift Sets" "Women" "Other"
Distribution by Primary Category
Now that we have collapsed and moved similar categories into one, we
can better visualize the relationship between the primary and secondary
categories. However, I also want to see the distribution of
rating
by primary_category
using a
boxplot.
# creating boxplot
ggplot(products, aes(x = primary_category, y = rating)) +
geom_boxplot(aes(fill = primary_category)) +
scale_fill_manual(values = as.vector(stepped(20))) +
labs(title = "Distribution of Ratings by Primary Category",
x = "Primary Category",
y = "Rating") +
theme_minimal() +
theme(legend.position = "none")
Upon observation of this box plot, we can see that of the 6 primary
categories, “Fragrance” and “Skincare” have the highest ratings.
However, these values are extremely close to the other category ratings,
such as “Bath & Body” and “Skincare”. “Makeup” has the lowest
rating, but it’s important to note that these are all very close and
therefore may not play a direct role in how high a particular product is
rated.
Primary and Secondary Categories
Now let’s explore the relationship between the categories themselves.
Often, products are marked with a specific category on the site and then
organized within other categories that are more specific and unique. I
decided to use a stacked bar chart to see how the primary categories
consist of products that belong to secondary categories. I expect that
there will be overlap between common category titles, but I am
particularly curious on categories that are more uncommon within
Sephora, such as how products for men are distributed.
# plotting stacked boxplot of secondary and primary categories
ggplot(products, aes(fill = secondary_category, x = primary_category)) +
geom_bar(position = "fill") +
labs(
title = "Stacked Boxplot of Primary and Secondary Categories",
x = "Primary", y = "Count", fill = "Secondary") +
scale_fill_manual(values=as.vector(stepped(20))) + theme_minimal()
Although we were able to collapse and group many of the secondary
categories together, there are still a lot of categories left to
analyze! Now we can better visualize which primary categories consist of
their respective secondary categories and the percentage as well. What
surprises me is that “Bath & Body” products are nearly 70% body
products, and the remainder consists of “Bath”, “Mini Size”, and “Value
& Gift Sets” and “Other”. We can also see that “Fragrance” is
majority for “Women” than “Men”. What’s also surprising is that even the
percentage of “Candles & Home Scents” is larger than the percentage
of products for “Men” under “Fragrance” primary category. To my
surprise, “Men” does not fall under many of the other primary
categories, and neither does “Women”. This is great in making sure that
most categories can be marketed toward any audience regardless of
gender. The secondary category “Value & Gift Sets” is also part of
every primary category, indicating that all categories of products have
some sort of value or gift set available to purchase. Since these two
categorical variables are related to each other, we should keep this in
mind when understanding how our variables can predict a rating.
Setting Up Models
Now that we have a better understanding of our dataset, we can start
setting up our models to predict product ratings! We need to split our
data into training and testing data, create our recipe, and perform
cross validation.
Splitting Data
First let’s split our data, and set the prop
argument to
be 0.7 so that 70% is used for training. The strata
argument is set to be our response variable, rating
. I
decided to use 0.7 as the proportion for the split so that there is a
significant amount of training data to help the model predict accurate
results. This will help with reducing overfitting and improve the
model’s performance on the testing data. It is also important to set a
seed so that we can reproduce our results.
set.seed(3435)
# splitting the data
products_split <- initial_split(products, prop = 0.7, strata = rating)
products_train <- training(products_split)
products_test <- testing(products_split)
# viewing the dimensions of training/testing data
dim(products_train)
## [1] 5744 15
dim(products_test)
## [1] 2464 15
There are 5744 observations in our training data, and 2464 in our
testing data. This is a good amount for us to use for testing and
training our products
dataset.
Building Recipe
Let’s build our recipe! The recipe will represent the model we are
fitting and can be given to the different model ‘engines’ that we are
using later. We will be using the 10 predictors:
loves_count
, reviews
,
limited_edition
, new
,
online_only
, out_of_stock
,
sephora_exclusive
, primary_category
,
secondary_category
, price_usd
out of the 27
original.
We also need to dummy-code any categorical variables and center and
scale the features to normalize them. Earlier, we tidied the data and
dropped the columns with missing values as well as observations with
missing data (as these were related), so there is no need to impute. In
addition, I already collapsed levels of primary_category
and secondary_category
by prioritizing their higher counts,
so we can add these predictors directly into the recipe. I also decided
to not include predictors: brand_name
,
brand_id
, product_name
, and
product_id
. This is because I want the model to be able to
predict new products or items from new brands based on the other
variables in our dataset. The model can then better generalize
predictions that will extend beyond the specific products or brands that
are already offered on the website. Our completed recipe is below:
# creating recipe
products_recipe <- recipe(rating ~ loves_count + reviews + limited_edition + new + online_only + out_of_stock + sephora_exclusive + primary_category + secondary_category + price_usd, data = products) %>%
step_dummy(all_nominal_predictors()) %>% # categorical variables are dummy encoded
step_nzv(all_predictors()) %>% # removing columns with zero variance
step_zv(all_predictors()) %>%
step_center(all_predictors()) %>% # features are centered and scaled for those models
step_scale(all_predictors())
prep(products_recipe) %>% bake(new_data = products_train) # prep and bake recipe to view
K-Fold Cross Validation
We want to find the best value of neighbors
that yields
the best performance when applied to new data. This hyperparameter
tuning process can be done using k-fold cross validation.
This process involves dividing randomly the set of observations into
“k” folds of equal size approximately. While the first fold is the
validation set, the method is fit on the remaining folds. In this case,
we are using k = 10 to create 10 folds.
We are stratifying on the response variable rating
.
product_folds <- vfold_cv(products_train, v = 10, strata = rating)
Model Building and Process
Now we can finally start building our models. The model building
process consists of these steps. For this project, I have decided to fit
6 models: Linear Regression, K Nearest Neighbors, Elastic Net Linear
Regression, Decision Tree, Random Forest, and Boosted Trees.
Process
- Setting up Model
The first step is to set up the model we are using. This includes
specifying the set_mode
, which is regression
in this case as our response variable rating
is continuous.
It also includes setting the engine, which will be specific to the type
of model that is being set up. We will also set specific parameters to
be tuned with tune()
.
# linear regression model
lm_model <- linear_reg() %>%
set_engine("lm")
# k- nearest neighbors
knn_model <- nearest_neighbor(neighbors = tune()) %>%
set_mode("regression") %>%
set_engine("kknn")
# elastic net linear model
en_model <- linear_reg(penalty = tune(),
mixture = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
# decision tree
tree_model <- decision_tree(cost_complexity = tune()) %>%
set_engine("rpart") %>%
set_mode("regression")
# random forest
rf_model <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
# gradient boosted trees
bt_model <- boost_tree(mtry = tune(),
trees = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
- Setting Up Model Workflow
In this step, we will be adding the model for each respective type,
and also the products_recipe
that we created earlier.
# linear regression
lm_wkflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(products_recipe)
# k- nearest neighbors
knn_wkflow <- workflow() %>%
add_model(knn_model) %>%
add_recipe(products_recipe)
# elastic net
en_wkflow <- workflow() %>%
add_recipe(products_recipe) %>%
add_model(en_model)
# decision tree
tree_wkflow <- workflow() %>%
add_model(tree_model) %>%
add_recipe(products_recipe)
# random forest
rf_wkflow <- workflow() %>%
add_model(rf_model) %>%
add_recipe(products_recipe)
# gradient boosted trees
bt_wkflow <- workflow() %>%
add_model(bt_model) %>%
add_recipe(products_recipe)
- Setting Up Grids for Tuning Parameters
This is a way to fit many models with different hyperparameters to
see which will perform the best. We will tune parameters for all models
with the exception of linear regression. Then, we will specify
appropriate ranges for the parameters that require tuning.
# linear regression: no tuning parameters, no grid
# k- nearest neighbors
knn_grid <- grid_regular(
neighbors(range = c(1,10)),
levels = 5)
# elastic net
en_grid <- grid_regular(
penalty(range = c(-1,0.5)),
mixture(range = c(0, 1)),
levels = 10)
# decision tree
param_grid <- grid_regular(
cost_complexity(range = c(-3, -1)),
levels = 10)
# random forest
rf_grid <- grid_regular(mtry(range = c(1,8)),
trees(range = c(10, 100)),
min_n(range = c(10, 20)),
levels = 5)
# boosted trees
bt_grid <- grid_regular(mtry(range = c(1,8)),
trees(range = c(20, 500)),
learn_rate(range = c(0.1,0.5),
trans = identity_trans()),
levels = 5)
- Tuning Models
To tune the models, we will use tune_grid()
, and specify
the model’s workflow and grid for our parameters.
# linear regression: no tuning parameters, no grid
# k- nearest neighbors
tune_knn <- tune_grid(knn_wkflow,
resamples = product_folds,
grid = knn_grid)
# elastic linear net
tune_en <- tune_grid(en_wkflow,
resamples = product_folds,
grid = en_grid)
# decision tree
tune_tree <- tune_grid(tree_wkflow,
resamples = product_folds,
grid = param_grid)
# random forest
tune_rf <- tune_grid(rf_wkflow,
resamples = product_folds,
grid = rf_grid)
# boosted trees
tune_bt <- tune_grid(bt_wkflow,
resamples = product_folds,
grid = bt_grid)
- Saving Tuned Models to RDS File
Now we can save each of the tuned models to a separate RDS file to
avoid rerunning the models.
# saving tuned models
save(tune_knn, file = "data/knn.rda") # K Nearest Neighbors
save(tune_en, file = "data/elastic.rda") # Elastic Net
save(tune_tree, file = "data/decisiontree.rda") # Decision Tree
save(tune_rf, file = "data/rf.rda") # Random Forest
save(tune_bt, file = "data/bt.rda") # Boosted Trees
- Loading Results
Let’s load the saved files.
load("data/knn.rda")
load("data/elastic.rda")
load("data/decisiontree.rda")
load("data/rf.rda")
load("data/bt.rda")
- Collecting Metrics
We can select the best RMSE value for each model to compare, but
first we should fit the linear regression to the folds. This is because
the linear regression model does not require tuning. We will save each
of these RMSE values to compare using a data frame in the next step.
# fit the linear regression to folds
lm_fit <- fit_resamples(lm_wkflow, resamples = product_folds)
# assigning best rmse value for each model
lm_rmse <- show_best(lm_fit, metric = "rmse")[1,]
knn_rmse <- show_best(tune_knn, metric = "rmse")[1,]
elastic_rmse <- show_best(tune_en, metric = "rmse")[1,]
dt_rmse <- show_best(tune_tree, metric = "rmse") [1,]
rf_rmse <- show_best(tune_rf, metric = "rmse")[1,]
bt_rmse <- show_best(tune_bt, metric = "rmse")[1,]
Model Results
Now we can compare the results of our models and visualize the best
performing ones. First, let’s take a look at the best RMSE of all our
models to determine which performed the best on the folded data.
# data frame to view results
results_rmse <- data.frame(Model = c("Linear Regression", "K Nearest Neighbors", "Elastic Net", "Decision Tree", "Random Forest", "Boosted Trees"), RMSE = c(lm_rmse$mean, knn_rmse$mean, elastic_rmse$mean, dt_rmse$mean, rf_rmse$mean, bt_rmse$mean)) # collecting results in data frame
results_rmse <- results_rmse %>%
arrange(RMSE) # arranging in order
results_rmse
The top 3 models with the lowest RMSE values are: Random Forest,
Boosted Trees, Decision Tree. We can visualize these:
# plotting barplot of results
ggplot(results_rmse, aes(x = Model, y = RMSE)) +
geom_bar(stat = "identity", aes(fill = Model)) +
ggtitle("Barplot of Models and RMSE") +
xlab("Model Type") + ylab("RMSE Value") +
scale_fill_manual(values = as.vector(stepped(24))) + theme_minimal()
Best Model Results
We determined that the best performing model on our folds with the
lowest RMSE was the best fit for predicting Sephora product ratings. We
should see which specific parameters were best for this random forest
model.
show_best(tune_rf, metric = "rmse")[1,]
And the best model is… Random Forest Model #48! This was the specific
random forest model that fit our data best, with an RMSE value of
0.4786, 50 trees, 4 predictors, and a minimum node size of 12.
Now that we determined the optimal Random Forest model in terms of
RMSE on our folded data, we can fit it to the training set.
Fitting to Training Data
Let’s fit the best model to the whole training data set.
# select best model
best_rf <- select_best(tune_rf, metric = "rmse")
# fit to training set
products_final <- finalize_workflow(rf_wkflow, best_rf)
products_final <- fit(products_final, data = products_train)
Fitting to Testing Data
While the random forest model performed the best, we want to
determine how it will be when faced with unfamiliar data. We will fit it
to our testing set and see its ability to generalize to brand new
data.
# fitting to testing data
products_final_test <- augment(products_final, products_test)
rmse(products_final_test, truth = rating, .pred)
We can see that our top model performed with the RMSE value of
0.4810755. Compared to the value of 0.4785896 from the folds, the random
forest model did slightly worse but not by a large value! The two RMSE
values are fairly close and the difference is around 0.001. Not bad! The
estimate is also acceptable when we consider the range of our outcome
variable, rating
, which is from 1 to 5.
Predicted vs. Actual Values
To visualize our model’s performance on the testing data, I created a
scatterplot of actual rating
values and model-predicted
values from the testing set. This will give us a good idea of how well
our model performed, visually.
# visualizing scatterplot of actual v predicted values
products_final_test %>%
ggplot(aes(x = .pred, y = rating)) +
geom_point(alpha = 0.3, col = "rosybrown3") +
geom_abline(lty = 1) +
ggtitle("Scatterplot of Actual and Predicted Values") +
theme_minimal()
Based on this plot, many of the points do fall on the line, though
most are clustered around it. Overall, we can see that though the model
was not able to perfectly predict rating
, it still did a
relatively good job as the points follow a similar pattern as the values
increase.
Variable Importance
Using our best Random Forest model, we can create a variable
importance plot to see which predictors were significant in determining
rating
.
# plotting VIP plot
products_final %>%
extract_fit_parsnip() %>%
vip(aesthetics = list(fill = "rosybrown3", color = "rosybrown3")) +
ggtitle("Variable Importance Plot") +
theme_minimal()
Like we expected, reviews
plays a large role in
predicting rating
! This makes sense as generally products
with more reviews might have more balanced ratings, making it useful in
predicting rating. There may be a consistent relationship between
reviews and ratings in the training dataset for the Random Forest model.
The second highest variable, loves_count
, is also a
significant predictor. Similarly to reviews
, the higher
number of “loves” for a particular product may be a good indicator to
our model of what a rating may be. Price, the third highest variable of
importance, also is reasonable as it can influence consumer expectations
when shopping. Consumers may rate products more highly when an expensive
item meets their expectations whereas if a product is overpriced, its
rating may be lower.
Conclusion
We have determined that the best performing model to predict a
product’s rating
from the Sephora dataset is the random
forest model. Although this model was not perfect in predicting
rating
, it is reasonable that it performed well since it
averages the results of multiple decision trees. In random forests, each
split of each tree is performed with a random subset, which can allow
for a comprehensive exploration of feature space. This allows it to
avoid overfitting and to be more flexible in generalizing to new
data.
The worst performing model was the K Nearest Neighbors model (KNN),
as it yielded the highest RMSE. This is also reasonable as it tends to
perform worse when there is high dimensionality and the distance between
points makes it difficult to find the distinct nearest neighbors. KNN
models are also sensitive to outliers, and may not be able to capture
the complex relationships in our data as there were many predictors
considered.
One way to improve model performance may be to incorporate and
consider other variables. These can include the number of sales a
particular product has received in the past year, how long the product
has been sold at Sephora, and whether or not the item is considered a
‘bestseller.’ Another predictor variable that would be interesting to
explore could be the ingredients of products. Ingredient conscious
consumers may be more inclined to give certain products a higher rating
than others. Utilizing predictors like these could provide further
insight on consumer shopping decisions with a more specific perspective
on product performance.
In addition, another method to move forward with analysis of
rating
could be to explore the actual text of
reviews
that are written on the site. Sentiment or text
analysis can determine the emotional tone of the review and which
products may have these common tones. This may be helpful for businesses
such as Sephora to determine common issues with certain products sold
depending on the review itself. It would be interesting to see how the
tone of the reviews
could also play a role in predicting
rating
.
Ultimately, process of exploring the dataset and evaluating model
performance has allowed me to further my understanding of predictive
methods and interpretation of results. This project has not only
enhanced my knowledge on machine learning techniques–it also broadened
my perspective on how such insights can have significant implications in
business contexts! I’ve also found myself more interested in
understanding the reasoning behind why a certain model may perform
better than others and how the features specific to my dataset can
contribute to this. Although the random forest model did not perfectly
predict the rating
of a Sephora product, this project has
been a valuable opportunity to apply machine learning methods to real
world data!