They want to figure out how many days a customer will rent a DVD based on some features and have approached you for help. They want you to try out some regression models that will help predict the number of days a customer will rent a DVD for.

The company wants a model that yields an MSE of 3 or less on a test set. The model you make will help the company become more efficient in inventory planning.

The data they provided is in the CSV file rental_info.csv. It has the following features:

rental_date: The date (and time) the customer rents the DVD.
return_date: The date (and time) the customer returns the DVD.
amount: The amount paid by the customer for renting the DVD.
rental_rate: The rate at which the DVD is rented for.
release_year: The year the movie being rented was released.
length: Length of the movie being rented, in minutes.
replacement_cost: The amount it will cost the company to replace the DVD.
special_features: Any special features, for example trailers/deleted scenes that the DVD also has.
NC-17, PG, PG-13, R: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convenience, the reference dummy has already been dropped.

1 Used Libraries

library(dplyr) # Used for Data Manipulation
library(tidyr)
library(readr) # Used for Reading Data
library(kableExtra) # Used for displaying tables in a more readable form
library(modelsummary) # Used for displaying summary tables
library(tinytable) # Used for highlighting important cells in a table
library(plotly) # Used for interactive plots
library(rsample) # Used for splitting the dataset
library(caret) # Used for training models
library(rpart) # Used for the Decision Tree model
library(glmnet) # Used for Lasso Regression
library(e1071) # Used to train the SVM model

2 Objectives

Import and inspect the data and apply necessary pre-processing transformations.
Test different regression models and assess their performance on a hold-out dataset by measuring their Mean Squared Error (MSE), which must not exceed a value of 2.85.

3 Cleaning Data

3.1 Initial inspection

# Reading data
data <- read_csv("rental_info.csv")

Rows: 15861 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): special_features
dbl  (9): amount, release_year, rental_rate, length, replacement_cost, NC-17...
dttm (2): rental_date, return_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(data) |>
  kbl()

rental_date	return_date	amount	release_year	rental_rate	length	replacement_cost	special_features	R
2005-05-25 02:54:33	2005-05-28 23:40:33	2.99	2005	2.99	126	16.99	{Trailers,"Behind the Scenes"}	1
2005-06-15 23:19:16	2005-06-18 19:24:16	2.99	2005	2.99	126	16.99	{Trailers,"Behind the Scenes"}	1
2005-07-10 04:27:45	2005-07-17 10:11:45	2.99	2005	2.99	126	16.99	{Trailers,"Behind the Scenes"}	1
2005-07-31 12:06:41	2005-08-02 14:30:41	2.99	2005	2.99	126	16.99	{Trailers,"Behind the Scenes"}	1
2005-08-19 12:30:04	2005-08-23 13:35:04	2.99	2005	2.99	126	16.99	{Trailers,"Behind the Scenes"}	1
2005-05-29 16:51:44	2005-06-01 21:43:44	2.99	2005	2.99	126	16.99	{Trailers,"Behind the Scenes"}	1

# Checking for null values
colSums(is.na(data))

     rental_date      return_date           amount     release_year 
               0                0                0                0 
     rental_rate           length replacement_cost special_features 
               0                0                0                0 
           NC-17               PG            PG-13                R 
               0                0                0                0

3.2 Extracting rental days

# Extract the duration of rental days
data <- data |>
    mutate(rental_days = as.numeric((return_date - rental_date) / 24)) |>
  select(rental_days, everything(), -c(rental_date, return_date))

head(data) |> kbl()

rental_days	amount	release_year	rental_rate	length	replacement_cost	special_features	R
3.865278	2.99	2005	2.99	126	16.99	{Trailers,"Behind the Scenes"}	1
2.836806	2.99	2005	2.99	126	16.99	{Trailers,"Behind the Scenes"}	1
7.238889	2.99	2005	2.99	126	16.99	{Trailers,"Behind the Scenes"}	1
2.100000	2.99	2005	2.99	126	16.99	{Trailers,"Behind the Scenes"}	1
4.045139	2.99	2005	2.99	126	16.99	{Trailers,"Behind the Scenes"}	1
3.202778	2.99	2005	2.99	126	16.99	{Trailers,"Behind the Scenes"}	1

3.3 Extracting special features

We notice that the special_features column is a vector of characters, so we have to separate each feature in a single column in order to use it for modelling later on.

# Exploring unique features
unique(data$special_features) |> kbl()

x
{Trailers,"Behind the Scenes"}
{Trailers}
{Commentaries,"Behind the Scenes"}
{Trailers,Commentaries}
{"Deleted Scenes","Behind the Scenes"}
{Commentaries,"Deleted Scenes","Behind the Scenes"}
{Trailers,Commentaries,"Deleted Scenes"}
{"Behind the Scenes"}
{Trailers,"Deleted Scenes","Behind the Scenes"}
{Commentaries,"Deleted Scenes"}
{Commentaries}
{Trailers,Commentaries,"Behind the Scenes"}
{Trailers,"Deleted Scenes"}
{"Deleted Scenes"}
{Trailers,Commentaries,"Deleted Scenes","Behind the Scenes"}

# Creating a column for each one of them
data <- data |>
    mutate(trailers = ifelse(grepl("Trailers", special_features), 1, 0),
           behind_the_scenes = ifelse(grepl("Behind the Scenes", special_features), 1, 0),
           commentaries = ifelse(grepl("Commentaries", special_features), 1, 0),
           deleted_scenes = ifelse(grepl("Deleted Scenes", special_features), 1, 0))

# Verifying the work
data |>
    select(special_features, trailers, behind_the_scenes, commentaries, deleted_scenes) |>
    sample_n(50) |>
  head(10) |>
  kbl()

special_features	trailers	behind_the_scenes	commentaries	deleted_scenes
{"Deleted Scenes"}	0	0	0	1
{Trailers,Commentaries}	1	0	1	0
{Trailers,"Behind the Scenes"}	1	1	0	0
{Commentaries}	0	0	1	0
{Trailers,"Deleted Scenes"}	1	0	0	1
{"Behind the Scenes"}	0	1	0	0
{Trailers,Commentaries,"Deleted Scenes"}	1	0	1	1
{Trailers,"Deleted Scenes"}	1	0	0	1
{Commentaries}	0	0	1	0
{Trailers,Commentaries,"Deleted Scenes"}	1	0	1	1

# Getting rid of the `special_features` column
data <- data |>
  select(-special_features)

4 EDA

4.1 Visualizations

4.1.1 Distribution of movie special features

4.1.2 Distribution of movie ratings

4.1.3 Distribution of movies per release year

4.1.4 Conclusion

A fairly balanced dataset when it comes to the special features, release year and ratings columns which ensure that the models we’ll build won’t be influenced by the dominance of one value over the others.

As for the rest of the columns, there’s no other categorical columns that could influence our model.

4.2 Additional insights on data

4.2.1 The rest of the columns summary

tinytable_6wmme53h6rfpr6otvkm7

	Unique	Mean	SD	Min	Median	Max
rental_days	5892	5.0	2.6	0.8	5.0	9.2
amount	12	4.2	2.4	1.0	4.0	12.0
rental_rate	3	2.9	1.6	1.0	3.0	5.0
length	140	115.0	40.1	46.0	114.0	185.0
replacement_cost	21	20.2	6.1	10.0	21.0	30.0

We notice that the replacement cost of DVDs is expensive sitting between 10 and 30 dollars which makes it essential for the company to maximize its profits from renting due to rental rates, which are the rates at which the DVD is rented for, being low compared to their cost only compensated by the fairly high average rental days of 5 days with the mean paid amount equal to $4.2 although lower amounts are more frequent (check the histograms).

4.2.2 Studying the correlation between variables

tinytable_h2bdnlj7n0yux3ko1v1i

	rental_days	amount	release_year	rental_rate	length	replacement_cost	NC-17	PG	PG-13	R	trailers	behind_the_scenes	commentaries	deleted_scenes
rental_days	1	.	.	.	.	.	.	.	.	.	.	.	.	.
amount	.54	1	.	.	.	.	.	.	.	.	.	.	.	.
release_year	.01	.02	1	.	.	.	.	.	.	.	.	.	.	.
rental_rate	.00	.71	.04	1	.	.	.	.	.	.	.	.	.	.
length	.00	.02	.03	.06	1	.	.	.	.	.	.	.	.	.
replacement_cost	.02	-.03	.07	-.07	.03	1	.	.	.	.	.	.	.	.
NC-17	.00	.01	.03	.04	-.03	.00	1	.	.	.	.	.	.	.
PG	-.01	-.01	-.02	.00	-.05	-.08	-.25	1	.	.	.	.	.	.
PG-13	.01	.01	.03	.02	.06	.04	-.27	-.27	1	.	.	.	.	.
R	-.01	-.01	-.05	-.03	.07	.02	-.25	-.25	-.27	1	.	.	.	.
trailers	.00	-.03	-.04	-.06	-.03	-.02	-.01	.01	.00	.00	1	.	.	.
behind_the_scenes	.00	-.02	.00	.00	.01	.01	.03	-.02	.00	.00	-.08	1	.	.
commentaries	.00	.02	-.04	.03	.01	-.02	.08	-.01	-.01	-.06	-.06	-.04	1	.
deleted_scenes	.00	-.01	.01	-.05	.00	.05	.02	.06	-.03	-.04	-.12	-.09	-.07	1

Due to the data not being normally distributed, we use the Spearman correlation method; we notice a reasonable strong positive correlation between amount and rental_rate with a moderate one between amount and rental_days as they logically determine each other.

We also spot a weak negative correlation between the ratings due to some of them being contradictory to the other; if a movie is rated R, it certainly won’t be also rated PG (for kids).

5 Model Building & Deployment

5.1 Preparing Data

5.1.1 Splitting the data

set.seed(6) # for reproducibility
split <- data |> 
    initial_split(prop = 0.75)
train <- training(split)
test <- testing(split)

5.1.2 Preparing important variables

rental_days_train <- train$rental_days
rental_days_test <- test$rental_days
train$rental_days = NULL
test$rental_days = NULL
formula <- rental_days_train ~ .

5.1.3 Data Preprocessing

In order to improve performance, a best practice is centering, where we subtract mean of variables from all the values, and scaling where variables are divided by their standard deviation.

preProc <- preProcess(train, method = c("center", "scale"))

train <- predict(preProc, train)
test <- predict(preProc, test)

5.2 Training Models

We’ll try a linear regression model followed by a more advanced one, Lasso (least absolute shrinkage and selection operator) regression which performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model.

Followed by a simple decision tree and an SVM model.

5.2.1 Linear Regression

# Training
lm_model <- lm(formula, data = train)

# Testing
pred <- predict(lm_model, newdata = test)

# Calculating MSE
lm_rmse <- RMSE(rental_days_test, pred)
lm_mse <- lm_rmse^2

5.2.2 Lasso Regression

# Performing k-fold cross-validation to find optimal lambda value
cv_model <- cv.glmnet(as.matrix(train), rental_days_train, alpha = 1)

# Optimal lambda value that minimizes MSE
best_lambda <- cv_model$lambda.min

# Training
lasso_model <- train(x = train, y = rental_days_train,
                   method = 'glmnet', 
                   trControl = trainControl(method = 'cv', number = 10),
                   tuneGrid = expand.grid(alpha = 1,
                                          lambda = best_lambda))

# Testing
pred <- predict(lasso_model, test)

# Calculating MSE
lasso_rmse <- RMSE(rental_days_test, pred)
lasso_mse <- lasso_rmse^2

The plot indicates that the most important variables are amount and rental_rate with the special features and ratings being non influential, meaning that we could’ve removed them from the model building step without affecting the general performance.

5.2.3 Decision Tree

# Training
decision_tree <- rpart(formula, data = train, method = "anova")

# Testing
pred <- predict(decision_tree, newdata = test)

# Calculating MSE
tree_rmse <- RMSE(rental_days_test, pred)
tree_mse <- tree_rmse^2

5.2.4 SVM

# Training
svm_model <- svm(formula, data = train)

# Testing
pred <- predict(svm_model, newdata = test)

# Calculating MSE
svm_rmse <- RMSE(rental_days_test, pred)
svm_mse <- svm_rmse^2

5.3 Comparing models

lm_mse	lasso_mse	tree_mse	svm_mse
2.764812	2.764234	2.339061	2.081244

All the models scored better than the demanded MSE 2.85 with the SVM model being the best although slow in its training (~11s) with the decision tree being the fastest and most simple one.