library(dplyr) # Used for Data Manipulation
library(tidyr)
library(readr) # Used for Reading Data
library(kableExtra) # Used for displaying tables in a more readable form
library(modelsummary) # Used for displaying summary tables
library(tinytable) # Used for highlighting important cells in a table
library(plotly) # Used for interactive plots
library(rsample) # Used for splitting the dataset
library(caret) # Used for training models
library(rpart) # Used for the Decision Tree model
library(glmnet) # Used for Lasso Regression
library(e1071) # Used to train the SVM model
They want to figure out how many days a customer will rent a DVD based on some features and have approached you for help. They want you to try out some regression models that will help predict the number of days a customer will rent a DVD for.
The company wants a model that yields an MSE of 3 or less on a test set. The model you make will help the company become more efficient in inventory planning.
The data they provided is in the CSV file rental_info.csv. It has the following features:
-
rental_date
: The date (and time) the customer rents the DVD. -
return_date
: The date (and time) the customer returns the DVD. -
amount
: The amount paid by the customer for renting the DVD. -
rental_rate
: The rate at which the DVD is rented for. -
release_year
: The year the movie being rented was released. -
length
: Length of the movie being rented, in minutes. -
replacement_cost
: The amount it will cost the company to replace the DVD. -
special_features
: Any special features, for example trailers/deleted scenes that the DVD also has. -
NC-17
,PG
,PG-13
,R
: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convenience, the reference dummy has already been dropped.
1 Used Libraries
2 Objectives
- Import and inspect the data and apply necessary pre-processing transformations.
- Test different regression models and assess their performance on a hold-out dataset by measuring their Mean Squared Error (MSE), which must not exceed a value of 2.85.
3 Cleaning Data
3.1 Initial inspection
# Reading data
data <- read_csv("rental_info.csv")
Rows: 15861 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): special_features
dbl (9): amount, release_year, rental_rate, length, replacement_cost, NC-17...
dttm (2): rental_date, return_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
rental_date | return_date | amount | release_year | rental_rate | length | replacement_cost | special_features | NC-17 | PG | PG-13 | R |
---|---|---|---|---|---|---|---|---|---|---|---|
2005-05-25 02:54:33 | 2005-05-28 23:40:33 | 2.99 | 2005 | 2.99 | 126 | 16.99 | {Trailers,"Behind the Scenes"} | 0 | 0 | 0 | 1 |
2005-06-15 23:19:16 | 2005-06-18 19:24:16 | 2.99 | 2005 | 2.99 | 126 | 16.99 | {Trailers,"Behind the Scenes"} | 0 | 0 | 0 | 1 |
2005-07-10 04:27:45 | 2005-07-17 10:11:45 | 2.99 | 2005 | 2.99 | 126 | 16.99 | {Trailers,"Behind the Scenes"} | 0 | 0 | 0 | 1 |
2005-07-31 12:06:41 | 2005-08-02 14:30:41 | 2.99 | 2005 | 2.99 | 126 | 16.99 | {Trailers,"Behind the Scenes"} | 0 | 0 | 0 | 1 |
2005-08-19 12:30:04 | 2005-08-23 13:35:04 | 2.99 | 2005 | 2.99 | 126 | 16.99 | {Trailers,"Behind the Scenes"} | 0 | 0 | 0 | 1 |
2005-05-29 16:51:44 | 2005-06-01 21:43:44 | 2.99 | 2005 | 2.99 | 126 | 16.99 | {Trailers,"Behind the Scenes"} | 0 | 0 | 0 | 1 |
rental_date return_date amount release_year
0 0 0 0
rental_rate length replacement_cost special_features
0 0 0 0
NC-17 PG PG-13 R
0 0 0 0
3.2 Extracting rental days
# Extract the duration of rental days
data <- data |>
mutate(rental_days = as.numeric((return_date - rental_date) / 24)) |>
select(rental_days, everything(), -c(rental_date, return_date))
head(data) |> kbl()
rental_days | amount | release_year | rental_rate | length | replacement_cost | special_features | NC-17 | PG | PG-13 | R |
---|---|---|---|---|---|---|---|---|---|---|
3.865278 | 2.99 | 2005 | 2.99 | 126 | 16.99 | {Trailers,"Behind the Scenes"} | 0 | 0 | 0 | 1 |
2.836806 | 2.99 | 2005 | 2.99 | 126 | 16.99 | {Trailers,"Behind the Scenes"} | 0 | 0 | 0 | 1 |
7.238889 | 2.99 | 2005 | 2.99 | 126 | 16.99 | {Trailers,"Behind the Scenes"} | 0 | 0 | 0 | 1 |
2.100000 | 2.99 | 2005 | 2.99 | 126 | 16.99 | {Trailers,"Behind the Scenes"} | 0 | 0 | 0 | 1 |
4.045139 | 2.99 | 2005 | 2.99 | 126 | 16.99 | {Trailers,"Behind the Scenes"} | 0 | 0 | 0 | 1 |
3.202778 | 2.99 | 2005 | 2.99 | 126 | 16.99 | {Trailers,"Behind the Scenes"} | 0 | 0 | 0 | 1 |
3.3 Extracting special features
We notice that the special_features
column is a vector of characters, so we have to separate each feature in a single column in order to use it for modelling later on.
x |
---|
{Trailers,"Behind the Scenes"} |
{Trailers} |
{Commentaries,"Behind the Scenes"} |
{Trailers,Commentaries} |
{"Deleted Scenes","Behind the Scenes"} |
{Commentaries,"Deleted Scenes","Behind the Scenes"} |
{Trailers,Commentaries,"Deleted Scenes"} |
{"Behind the Scenes"} |
{Trailers,"Deleted Scenes","Behind the Scenes"} |
{Commentaries,"Deleted Scenes"} |
{Commentaries} |
{Trailers,Commentaries,"Behind the Scenes"} |
{Trailers,"Deleted Scenes"} |
{"Deleted Scenes"} |
{Trailers,Commentaries,"Deleted Scenes","Behind the Scenes"} |
# Creating a column for each one of them
data <- data |>
mutate(trailers = ifelse(grepl("Trailers", special_features), 1, 0),
behind_the_scenes = ifelse(grepl("Behind the Scenes", special_features), 1, 0),
commentaries = ifelse(grepl("Commentaries", special_features), 1, 0),
deleted_scenes = ifelse(grepl("Deleted Scenes", special_features), 1, 0))
# Verifying the work
data |>
select(special_features, trailers, behind_the_scenes, commentaries, deleted_scenes) |>
sample_n(50) |>
head(10) |>
kbl()
special_features | trailers | behind_the_scenes | commentaries | deleted_scenes |
---|---|---|---|---|
{"Deleted Scenes"} | 0 | 0 | 0 | 1 |
{Trailers,Commentaries} | 1 | 0 | 1 | 0 |
{Trailers,"Behind the Scenes"} | 1 | 1 | 0 | 0 |
{Commentaries} | 0 | 0 | 1 | 0 |
{Trailers,"Deleted Scenes"} | 1 | 0 | 0 | 1 |
{"Behind the Scenes"} | 0 | 1 | 0 | 0 |
{Trailers,Commentaries,"Deleted Scenes"} | 1 | 0 | 1 | 1 |
{Trailers,"Deleted Scenes"} | 1 | 0 | 0 | 1 |
{Commentaries} | 0 | 0 | 1 | 0 |
{Trailers,Commentaries,"Deleted Scenes"} | 1 | 0 | 1 | 1 |
# Getting rid of the `special_features` column
data <- data |>
select(-special_features)
4 EDA
4.1 Visualizations
4.1.1 Distribution of movie special features
4.1.2 Distribution of movie ratings
4.1.3 Distribution of movies per release year
4.1.4 Conclusion
A fairly balanced dataset when it comes to the special features, release year and ratings columns which ensure that the models we’ll build won’t be influenced by the dominance of one value over the others.
As for the rest of the columns, there’s no other categorical columns that could influence our model.
4.2 Additional insights on data
4.2.1 The rest of the columns summary
Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
---|---|---|---|---|---|---|---|---|
rental_days | 5892 | 0 | 5.0 | 2.6 | 0.8 | 5.0 | 9.2 | |
amount | 12 | 0 | 4.2 | 2.4 | 1.0 | 4.0 | 12.0 | |
rental_rate | 3 | 0 | 2.9 | 1.6 | 1.0 | 3.0 | 5.0 | |
length | 140 | 0 | 115.0 | 40.1 | 46.0 | 114.0 | 185.0 | |
replacement_cost | 21 | 0 | 20.2 | 6.1 | 10.0 | 21.0 | 30.0 |
We notice that the replacement cost of DVDs is expensive sitting between 10 and 30 dollars which makes it essential for the company to maximize its profits from renting due to rental rates, which are the rates at which the DVD is rented for, being low compared to their cost only compensated by the fairly high average rental days of 5 days with the mean paid amount equal to $4.2 although lower amounts are more frequent (check the histograms).
4.2.2 Studying the correlation between variables
rental_days | amount | release_year | rental_rate | length | replacement_cost | NC-17 | PG | PG-13 | R | trailers | behind_the_scenes | commentaries | deleted_scenes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rental_days | 1 | . | . | . | . | . | . | . | . | . | . | . | . | . |
amount | .54 | 1 | . | . | . | . | . | . | . | . | . | . | . | . |
release_year | .01 | .02 | 1 | . | . | . | . | . | . | . | . | . | . | . |
rental_rate | .00 | .71 | .04 | 1 | . | . | . | . | . | . | . | . | . | . |
length | .00 | .02 | .03 | .06 | 1 | . | . | . | . | . | . | . | . | . |
replacement_cost | .02 | -.03 | .07 | -.07 | .03 | 1 | . | . | . | . | . | . | . | . |
NC-17 | .00 | .01 | .03 | .04 | -.03 | .00 | 1 | . | . | . | . | . | . | . |
PG | -.01 | -.01 | -.02 | .00 | -.05 | -.08 | -.25 | 1 | . | . | . | . | . | . |
PG-13 | .01 | .01 | .03 | .02 | .06 | .04 | -.27 | -.27 | 1 | . | . | . | . | . |
R | -.01 | -.01 | -.05 | -.03 | .07 | .02 | -.25 | -.25 | -.27 | 1 | . | . | . | . |
trailers | .00 | -.03 | -.04 | -.06 | -.03 | -.02 | -.01 | .01 | .00 | .00 | 1 | . | . | . |
behind_the_scenes | .00 | -.02 | .00 | .00 | .01 | .01 | .03 | -.02 | .00 | .00 | -.08 | 1 | . | . |
commentaries | .00 | .02 | -.04 | .03 | .01 | -.02 | .08 | -.01 | -.01 | -.06 | -.06 | -.04 | 1 | . |
deleted_scenes | .00 | -.01 | .01 | -.05 | .00 | .05 | .02 | .06 | -.03 | -.04 | -.12 | -.09 | -.07 | 1 |
Due to the data not being normally distributed, we use the Spearman correlation method; we notice a reasonable strong positive correlation between amount
and rental_rate
with a moderate one between amount
and rental_days
as they logically determine each other.
We also spot a weak negative correlation between the ratings due to some of them being contradictory to the other; if a movie is rated R, it certainly won’t be also rated PG (for kids).
5 Model Building & Deployment
5.1 Preparing Data
5.1.1 Splitting the data
set.seed(6) # for reproducibility
split <- data |>
initial_split(prop = 0.75)
train <- training(split)
test <- testing(split)
5.1.2 Preparing important variables
rental_days_train <- train$rental_days
rental_days_test <- test$rental_days
train$rental_days = NULL
test$rental_days = NULL
formula <- rental_days_train ~ .
5.1.3 Data Preprocessing
In order to improve performance, a best practice is centering, where we subtract mean of variables from all the values, and scaling where variables are divided by their standard deviation.
preProc <- preProcess(train, method = c("center", "scale"))
train <- predict(preProc, train)
test <- predict(preProc, test)
5.2 Training Models
We’ll try a linear regression model followed by a more advanced one, Lasso (least absolute shrinkage and selection operator) regression which performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model.
Followed by a simple decision tree and an SVM model.
5.2.1 Linear Regression
5.2.2 Lasso Regression
# Performing k-fold cross-validation to find optimal lambda value
cv_model <- cv.glmnet(as.matrix(train), rental_days_train, alpha = 1)
# Optimal lambda value that minimizes MSE
best_lambda <- cv_model$lambda.min
# Training
lasso_model <- train(x = train, y = rental_days_train,
method = 'glmnet',
trControl = trainControl(method = 'cv', number = 10),
tuneGrid = expand.grid(alpha = 1,
lambda = best_lambda))
# Testing
pred <- predict(lasso_model, test)
# Calculating MSE
lasso_rmse <- RMSE(rental_days_test, pred)
lasso_mse <- lasso_rmse^2
The plot indicates that the most important variables are amount
and rental_rate
with the special features and ratings being non influential, meaning that we could’ve removed them from the model building step without affecting the general performance.
5.2.3 Decision Tree
5.2.4 SVM
5.3 Comparing models
lm_mse | lasso_mse | tree_mse | svm_mse |
---|---|---|---|
2.764812 | 2.764234 | 2.339061 | 2.081244 |
All the models scored better than the demanded MSE 2.85 with the SVM model being the best although slow in its training (~11s) with the decision tree being the fastest and most simple one.