import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
from pathlib import Path
Introduction
After beating the giant Blockbuster in the 2000s, Netflix established itself as the top provider of movies; firstly by shipping DVDs and then by becoming one of the first adopters of streaming content online that paved the way into its success today.
Being part of our digital life, we will be analyzing its dataset provided by Datacamp.
Data Description
netflix_data.csv
Column | Description |
---|---|
show_id |
The ID of the show |
type |
Type of show |
title |
Title of the show |
director |
Director of the show |
cast |
Cast of the show |
country |
Country of origin |
date_added |
Date added to Netflix |
release_year |
Year of Netflix release |
duration |
Duration of the show in minutes |
description |
Description of the show |
genre |
Show genre |
Used Libraries
Data Cleaning
To ensure data quality, we have to inspect the dataset first and clean its records in order to obtain viable data.
= pd.read_csv('netflix_data.csv')
data data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4812 entries, 0 to 4811
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 4812 non-null object
1 type 4812 non-null object
2 title 4812 non-null object
3 director 4812 non-null object
4 cast 4812 non-null object
5 country 4812 non-null object
6 date_added 4812 non-null object
7 release_year 4812 non-null int64
8 duration 4812 non-null int64
9 description 4812 non-null object
10 genre 4812 non-null object
dtypes: int64(2), object(9)
memory usage: 413.7+ KB
= data.copy() df
Upon trying to convert the column date_added
, we notice that there is an error in the 161st row where the date has a space at the beginning of the string.
pd.to_datetime(df['date_added']) time data " August 4, 2017" doesn't match format "%B %d, %Y", at position 161.
'date_added'].str.startswith(' ')][['title', 'date_added']] df[df[
title date_added
203 Abnormal Summit August 4, 2017
1339 Father Brown March 31, 2018
2477 Mars November 1, 2019
3292 Royal Pains May 18, 2017
#4 records that can cause problems as they start with a space
'date_added'] = df['date_added'].str.lstrip()
df[#Verifying
'date_added'].str.startswith(' ')] df[df[
Empty DataFrame
Columns: [show_id, type, title, director, cast, country, date_added, release_year, duration, description, genre]
Index: []
#It now works fine
'date_added']) pd.to_datetime(df[
0 2016-12-23
1 2018-12-20
2 2017-11-16
3 2020-01-01
4 2017-07-01
...
4807 2019-11-01
4808 2018-07-01
4809 2020-01-11
4810 2020-10-19
4811 2019-03-02
Name: date_added, Length: 4812, dtype: datetime64[ns]
After cleaning the column and turning to the duration
one, we spot a bizarre duration of potentially 1 minute for some shows which could cause problems.
After inspecting the data, it seems that we have two types of shows; Movies and TV Shows where the latter has the same odd duration
records due to shows typically being measured with the number of seasons/episodes.
'title', 'duration']].sort_values('duration').head() df[[
title duration
1935 Jack Taylor 1
2411 Maggie & Bianca: Fashion Friends 1
1879 Inhuman Resources 1
2956 Paava Kadhaigal 1
2942 Ouran High School Host Club 1
#it appears that TV shows have incorrect durations which is logical due to them having many episodes
'type'] == 'TV Show']['duration'].describe() df[df[
count 135.000000
mean 1.940741
std 2.118726
min 1.000000
25% 1.000000
50% 1.000000
75% 2.000000
max 15.000000
Name: duration, dtype: float64
#To insure that all the data for tv shows is incorrect we will check the TV show with the longest duration
'type'] == 'TV Show']['duration'].idxmax()] df.iloc[df[df[
show_id s5913
type TV Show
title Supernatural
director Phil Sgriccia
cast Jared Padalecki, Jensen Ackles, Mark Sheppard,...
country United States
date_added June 5, 2020
release_year 2019
duration 15
description Siblings Dean and Sam crisscross the country, ...
genre Classic
Name: 3678, dtype: object
In order to make sure that our assumption is correct, we will explore communicating with an LLM model in two ways to verify the duration
of the TV show with the longest duration: Supernatural.
Code for communicating with Qwen2.5 using HuggingFace’s API to get verify the duartion.
from huggingface_hub import InferenceClient
file = open('api.txt', 'r')
= file.read()
api file.close()
= InferenceClient(api_key=api)
client
= [
messages
{"role": "user",
"content": "What is the duration of the TV Show Supernatural?"
}
]
= client.chat.completions.create(
stream ="Qwen/Qwen2.5-Coder-32B-Instruct",
model=messages,
messages=500,
max_tokens=True
stream
)
for chunk in stream:
print(chunk.choices[0].delta.content, end="")
The TV show “Supernatural” aired for 15 seasons, from 2005 to 2020. Each season typically consists of around 22 episodes, with some seasons having slightly more or fewer episodes. The duration of the show in terms of number of episodes is 310 episodes in total. Each episode is usually about 42 to 45 minutes long, including commercials.
Alternativly, we can use the R package {ollamar} to communicate with a locally installed LLM model in order to get answers.
set.seed(21)
generate("llama3.2",
"What is the duration of the TV Show Supernatural? Answer it shortly and concisely.",
output = "text")
[1] “The TV show Supernatural aired for 15 seasons from 2005 to 2020, with a total of 327 episodes.”
According to the answers provided by the two models, the duration
column for TV shows is mainly for the number of seasons available on Netflix which causes us to separate movies and TV shows into two different dataframes.
= df[df['type'] == 'TV Show']
tv_shows = df[df['type'] != 'TV Show'] movies
Data Extraction
For this part of the report, we will be transforming data which we will use to create visualizations and highlight interesting insights from the dataset.
Extracting the Top 10 Most Productive Countries
= df.groupby('country').size().sort_values(ascending = False).head(10) top_10_countries
Extracting the Top 10 Directors Info
An interesting analysis would be to analyze the most productive directors from their favorite actor to their most produced genre.
= df.assign(actors=df['cast'].str.split(', ')).explode('cast')
df_expanded
= df_expanded.groupby('director').size().sort_values(ascending=False).head(10)
top_10_directors
#A function to get the most produced genre for each director
def top_genre(x):
return x.mode().iloc[0] if not x.empty else None
#Function to get most casted actor by each director
def most_common_actor(actors_column):
= [actor for actors_list in actors_column for actor in actors_list]
all_actors return pd.Series(all_actors).mode().iloc[0] if all_actors else None
= pd.DataFrame({
director_analysis 'num_shows': top_10_directors,
'most_produced_genre': df_expanded.groupby('director')['genre'].apply(top_genre),
'mean_duration': df_expanded.groupby('director')['duration'].mean().round(2),
'favorite_actor': df_expanded.groupby('director')['actors'].apply(most_common_actor)
}).loc[top_10_directors.index]
#Reset index to make the director's a column
= director_analysis.reset_index()
director_analysis
#Rename columns for clarity
= director_analysis.rename(columns={
director_analysis 'director': 'Director',
'num_shows': 'Number of Shows',
'most_produced_genre': 'Most Produced Genre',
'mean_duration': 'Average Duration',
'favorite_actor': 'Most Featured Actor'
})
Extracting Holiday-themed Movies
= df[df['title'].str.contains('Christmas|Holiday|Santa|Xmas|Easter|Thanksgiving|New Year', case=False, na=False)].reset_index() holiday_movies
Adding a sentiment column would be useful for those who want ‘happy vibes’ movies instead of an action packed one.
R code to communicate with a locally installed Llama 3.2 to extract sentiments using the {mall} package.
holiday_movies_t <- reticulate::py$holiday_movies |>
select(title, genre, release_year, duration, description) |>
llm_sentiment(col = description,
pred_name = 'sentiment')
Ollama local server running
── mall session object
Backend: Ollama
LLM session: model:llama3.2:latest
R session: cache_folder:/tmp/RtmpVAatHB/_mall_cache3ebd55a2c5fc4
R code for creating the following table.
library(reactable)
library(dplyr)
library(htmltools)
get_duration_color <- function(duration) {
color_pal <- function(x) rgb(colorRamp(c("#9fc7df", "#416ea4"))(x), maxColorValue = 255)
normalized <- (duration - min(duration)) / (max(duration) - min(duration))
color_pal(normalized)
}
get_sentiment_color <- function(sentiment) {
color_pal <- function(x) rgb(colorRamp(c("#ff6b6b", "#4ecdc4", "#f5f5dc"))(x), maxColorValue = 255)
# Create a numeric mapping for sentiments
sentiment_mapping <- factor(sentiment, levels = c("negative", "positive", "neutral"))
normalized <- ifelse(is.na(sentiment_mapping), 0.5,
(as.numeric(sentiment_mapping) - 1) / (nlevels(sentiment_mapping) - 1))
color_pal(normalized)
}
holiday_movies_t <- holiday_movies_t %>%
select(title, genre, release_year, duration, sentiment) %>%
mutate(duration_color = get_duration_color(duration),
sentiment_color = get_sentiment_color(sentiment))
holiday_movies_table <- reactable(
holiday_movies_t,
defaultColDef = colDef(
vAlign = "center",
headerClass = "header",
style = list(fontFamily = "Karla, sans-serif")
),
columns = list(
title = colDef(
name = "Movie Title",
minWidth = 250
),
genre = colDef(
name = "Genre",
minWidth = 100
),
release_year = colDef(
name = "Year",
defaultSortOrder = "desc",
width = 80
),
duration = colDef(
name = "Duration",
defaultSortOrder = "desc",
cell = function(value, index) {
color <- holiday_movies_t$duration_color[index]
div(
style = sprintf(
"background-color: %s; border-radius: 4px; padding: 4px; color: black;",
color
),
value
)
},
width = 120
),
sentiment = colDef(
name = "Sentiment",
cell = function(value, index) {
color <- holiday_movies_t$sentiment_color[index]
div(
style = sprintf(
"background-color: %s; border-radius: 4px; padding: 4px; color: black;",
color
),
value
)
},
width = 140
),
duration_color = colDef(show = FALSE),
sentiment_color = colDef(show = FALSE)
),
highlight = TRUE,
language = reactableLang(
noData = "No Holiday movies found",
pageInfo = "{rowStart}\u2013{rowEnd} of {rows} Holiday movies"
),
theme = reactableTheme(
backgroundColor = "transparent",
color = "#333333",
highlightColor = "rgba(0, 0, 0, 0.05)",
borderColor = "rgba(0, 0, 0, 0.1)",
headerStyle = list(
backgroundColor = "transparent",
borderColor = "rgba(0, 0, 0, 0.1)",
borderBottomWidth = "2px",
fontWeight = "600"
)
),
class = "holiday-movies-tbl",
searchable = TRUE,
striped = TRUE,
defaultPageSize = 10,
style = list(fontFamily = "Karla, sans-serif")
)
holiday_movies_table
Data Visualization
The scene on Netflix is expectedly dominated by Americans with a surprising 2nd place for India that is logical after remembering the rise of Bollywood these recent years.
Code
= df['country'].value_counts()
movies_per_country
= px.choropleth(locations=movies_per_country.index,
fig ='country names',
locationmode=movies_per_country.values,
color='Netflix Production Distribution Across Countries',
title= 'Blues')
color_continuous_scale fig.show()
Most movies on Netflix are newish ranging from 2016 to 2018 which were their most active years, but they since got more selective to favor quality shows.
Code
= df.groupby('release_year').size().reset_index(name='count')
shows_per_year
= px.bar(shows_per_year, x='release_year', y='count',
fig ='Distribution of Netflix Shows/Movies release years',
title={'release_year': 'Release Year', 'count': 'Number of Shows'})
labels
fig.show()
Movies on Netflix are mostly Dramas and Comedies with both genres taking up 52.3% of the dataset, with Action coming up third with 15.4% and surprisingly the Children genre being fourth with 421 shows!
The Top 3 are expected from a platform like Netflix which was always known for its drama-packed and intriguing series/movies.
Code
= df['genre'].value_counts()[df['genre'].value_counts() > 50]
genre_counts
= px.pie(values=genre_counts.values,
fig =genre_counts.index,
names='Distribution of Genres')
title
= 'Karla') fig.update_layout(font_family
Starting with a rather balanced catalog, Netflix went on to focus on Drama and Comedy which paved their way into success and more attraction.
Code
= Path('KarlaRegular.ttf')
font_path
= fm.FontProperties(fname=font_path)
karla_font
fm.fontManager.addfont(font_path)
'font.family'] = karla_font.get_name()
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.spines.top'] = False
plt.rcParams[
= df.groupby(['release_year', 'genre']).size().unstack(fill_value=0)
genre_yearly
= genre_yearly.sum().nlargest(5).index
top_genres
=(12,6))
plt.figure(figsizefor genre in top_genres:
=genre)
plt.plot(genre_yearly.index, genre_yearly[genre], label
= plt.gcf()
fig 0)
fig.patch.set_alpha(= plt.gca()
ax '#CCC1B7')
ax.set_facecolor('Top 5 Genres Over the Years',
plt.title(=20,
fontsize='bold',
fontweight='#333333')
color'Year')
plt.xlabel('Number of Shows')
plt.ylabel(1940, 2021);
plt.xlim(='x', nbins=16)
ax.locator_params(axis
plt.legend() plt.show()
Interestingly, 5 out of the Top 10 directors primarily produce stand-ups! Which is kind of expected because stand-ups don’t require much effort compared to movies but surprising when we check the previous plot that has stand-ups as the 6th most produced genre with 283 shows.
Code
= Path('KarlaRegular.ttf')
font_path
= fm.FontProperties(fname=font_path)
karla_font
fm.fontManager.addfont(font_path)
'font.family'] = karla_font.get_name()
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.spines.top'] = False
plt.rcParams[
= plt.subplots(figsize=(12, 8))
fig, ax '#CCC1B7')
ax.set_facecolor(0)
fig.patch.set_alpha(
# def apply_dark_mode(ax, fig):
# for text in ax.texts:
# text.set_color('white')
#
# ax.tick_params(axis='x', colors='white')
# ax.tick_params(axis='y', colors='white')
#
# for label in ax.get_yticklabels():
# label.set_color('white')
#
# for label in ax.get_xticklabels():
# label.set_color('white')
#
# for figtext in fig.texts:
# figtext.set_color('white')
#
# if dark_mode:
# apply_dark_mode(ax, fig)
= director_analysis['Director'].values
top_directors = df[df['director'].isin(top_directors)]
df_filtered = df_filtered.groupby(['director', 'genre']).size().unstack(fill_value=0)
genre_counts = genre_counts.div(genre_counts.sum(axis=1), axis=0)
genre_proportions = genre_proportions.reindex(top_directors[::-1])
genre_proportions
= {
custom_colors 'Action': '#a10000',
'Children': '#093803',
'Classic Movies': '#4b4b4b',
'Comedies': '#298605',
'Documentaries': '#633103',
'Dramas': '#904fb6',
'International Movies': '#00a2ff',
'Stand-Up': '#1f5383'
}
= df['director'].value_counts().head(10)
director_counts = np.zeros(len(top_directors))
left = top_directors[::-1]
directors = director_counts.values[::-1]
counts
for genre in genre_proportions.columns:
= counts * genre_proportions[genre].values
width = ax.barh(range(len(directors)), width, left=left,
bars =custom_colors[genre], alpha=0.9, label=genre)
color
for i, bar in enumerate(bars):
= genre_proportions[genre].values[i] * 100
percentage if percentage >= 5:
= left[i] + width[i]/2
center
ax.text(center, i, f'{genre}\n{percentage:.1f}%',
='center',
ha='center',
va=8,
fontsize='white')
color
+= width
left
0, 1.1, 'Top 10 Most Productive Directors',
ax.text(=ax.transAxes,
transform=20,
fontsize='bold',
fontweight='#333333')
color
0, 1.05, 'Number of titles directed by each director, with genre distribution',
ax.text(=ax.transAxes,
transform=14,
fontsize='#333333',
color=0.8)
alpha
='both', colors='#333333')
ax.tick_params(axis='y', length=0)
ax.tick_params(axis
0, 18);
plt.xlim(='x', nbins=9)
ax.locator_params(axisTrue, axis='x', linestyle='--', alpha=0.2, color='#666666')
ax.grid(False, axis='y')
ax.grid(
'')
plt.ylabel(range(len(directors)))
ax.set_yticks(
ax.set_yticklabels(directors)
for i, v in enumerate(counts):
+ 0.2, i,
ax.text(v str(int(v)),
='center',
va=10,
fontsize='#333333')
color
=(1.05, 1), loc='upper left', fontsize=8)
plt.legend(bbox_to_anchor
0.85, 0.02, 'Source: DataCamp Datasets',
plt.figtext(=9,
fontsize='#333333',
color=0.7)
alpha
plt.tight_layout()
Further analysis highlights the difference of genres mostly seen in the average duration and the most featured actors where you see renowned names like Harrison Ford compared to the near unknown names! (except for Bill Burr of course.)
R code to display the following table.
library(gt)
library(gtExtras)
add_text_img <- function (text, url, height = 30, left = FALSE) {
text_div <- glue::glue("<div style='display:inline;vertical-align: top;'>{text}</div>")
img_div <- glue::glue("<div style='display:inline;margin-left:2px'>{web_image(url = url, height = height)}</div>")
if (isFALSE(left)) {
paste0(text_div, img_div) %>% gt::html()
}
else {
paste0(img_div, text_div) %>% gt::html()
}
}
reticulate::py$director_analysis %>%
gt() %>%
gt_theme_538() %>%
tab_header(
title = add_text_img(
'Top 10 Directors on',
url = 'https://images.ctfassets.net/y2ske730sjqp/821Wg4N9hJD8vs5FBcCGg/9eaf66123397cc61be14e40174123c40/Vector__3_.svg?w=460',
height = 30
)
) %>%
opt_table_font(font = 'Karla') %>%
opt_align_table_header(align = 'center') %>%
tab_options(table.background.color = 'transparent', column_labels.background.color = "#F0F0F0") %>%
cols_align(columns = c(2:4), align = "center")
Top 10 Directors on
|
||||
---|---|---|---|---|
Director | Number of Shows | Most Produced Genre | Average Duration | Most Featured Actor |
Raúl Campos, Jan Suter | 18 | Stand-Up | 63.61 | Sofía Niño de Rivera |
Marcus Raboy | 15 | Stand-Up | 59.93 | Vir Das |
Jay Karas | 14 | Stand-Up | 71.14 | Bill Burr |
Cathy Garcia-Molina | 13 | Comedies | 118.23 | Joross Gamboa |
Jay Chapman | 12 | Stand-Up | 61.67 | D.L. Hughley |
Martin Scorsese | 12 | Classic Movies | 142.17 | Harvey Keitel |
Youssef Chahine | 12 | Classic Movies | 123.50 | Mahmoud El Meleigy |
Steven Spielberg | 10 | Action | 136.70 | Harrison Ford |
David Dhawan | 9 | Comedies | 138.67 | Anupam Kher |
Shannon Hartman | 8 | Stand-Up | 65.25 | Jo Koy |