Building a Simple Content-Based Recommender System for Movies and TV Shows with Python

Nicole Sim
7 min readMay 13, 2020

With over 13,000 titles on Netflix, there is an overwhelming number of entertainment options to choose from! As such, this learning project aims to create a simple content-based recommender system that can recommend TV shows and movies to a user. Of course, this is far more basic compared to industry standards but it is a fun personal project to work on!

We would take in an input which is a user’s favourite show/movie and pick up the top 10 films that are most similar to his/her personal favourite. Here, we explore 2 possible ways to identify similar items: (1) A simple similarity measure — Cosine Similarity (2) Clustering Algorithm — Latent Dirichlet Allocation (LDA).

Data Set used: Kaggle Netflix Movies and TV Shows

Intro to Recommender System

Recommender Systems can be generally divided into 2 categories: Collaborative Filtering and Content-based. A Collaborative Filtering system recommends an item that other users of similar characteristics have liked in the past. A content-based recommender system recommends an item which are similar to the ones the user has liked in the past. Since the data set contains only item data, we would focus on creating a basic content-based recommender system.

Basics on Text Similarity

There are various text similarity metrics and one of the popular metrics is Cosine Similarity. Cosine Similarity measures the similarity between 2 documents by measuring the cosine of angle between two vectors. Here’s a simple example to illustrate the calculation of cosine similarity:

Basics on Topic Modelling

Topic Modelling is an unsupervised learning technique which groups documents based on content similarity. One popular algorithm is Latent Dirichlet Allocation (LDA). In LDA, each topic is a probability distribution of words and each document is a probability distribution of topics. The more similar the documents are, the closer they are to each other in the multi-dimensional vector space, thus forming clusters.

Let’s get hands-on!

1. Cosine Similarity

1.1 Import all the required packages.

import pandas as pd
import numpy as np
import nltk
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

1.2 Read the CSV and understand the data.

df = pd.read_csv(“../input/netflix-shows/netflix_titles.csv”)
df.head()
  • type - Movie/TV Show
  • title - the title of the movie/TV show
  • director - The name of the movie director. There can be multiple directors, delimited by comma.
  • cast - The cast involved in the movie. There can be multiple directors, delimited by comma.
  • rating - The rating for a movie. There are 12 unique values (TV-PG, TV-MA, TV-Y7-FV, TV-Y7, TV-14, R, TV-Y, NR, PG-13, TV-G, PG, G)
  • listed_in - The genre of a movie. For eg, Action & Adventure, Documentaries, Comedies.
  • description - Story plot summary, about a sentence long per movie.

1.3 Data Pre-Processing. Since factors such as Director, Cast, Rating, Genre and storyline influences a person’s decision, I have combined them into a single string.

cols = [‘title’, ‘type’, ‘listed_in’, ‘director’, ‘cast’, ‘rating’, ‘description’]
df[‘combined’] = df[cols].apply(lambda row: ‘ ‘.join(row.values.astype(str)), axis=1)

1.4 As I spotted non-English words such as Chinese characters in the movie title, I have removed non-English words using regular expression.

df[‘combined’] = df[‘combined’].map(lambda x: re.sub(“([^\x00-\x7F])+”,””, x))

1.5 Create Document Vector. The CountVectorizer removes stopwords such as ‘the’ and ‘a’.

documents = df[‘combined’]
count_vectorizer = CountVectorizer(stop_words=’english’) # convert all words to lowercase and remove stop words
sparse_matrix = count_vectorizer.fit_transform(documents)

1.6 Compute Cosine Similarity between each document.

similarity_scores = cosine_similarity(sparse_matrix, sparse_matrix)

1.7 Top 10 Recommended Movies/TV Shows

def recommend(title,scores_df, df):
recommended = []

title = title.lower()
df[‘title’] = df[‘title’].str.lower()
index = df[df[‘title’]==title].index[0]
top10_list = list(scores_df.iloc[index].sort_values(ascending = False).iloc[1:11].index)

for each in top10_list:
recommended.append(df.iloc[each].title)

return recommended
recommend(‘Naruto Shippuden : Blood Prison’,scores_df, df)

>> [‘naruto shippuden: the movie’, ‘naruto shippûden the movie: bonds’,
‘naruto shippuden: the movie: the lost tower’, ‘naruto shippûden the movie: the will of fire’, ‘naruto’, ‘naruto the movie 2: legend of the stone of gelel’,
‘naruto the movie 3: guardians of the crescent moon kingdom’, ‘naruto the movie: ninja clash in the land of snow’, ‘berserk: the golden age arc iii — the advent’, ‘id-0’]

recommend(‘Avengers: Infinity War’,scores_df, df)

>> [‘thor: ragnarok’, “cirque du freak: the vampire’s assistant”, ‘limitless’,
‘inception’, ‘chris brown: welcome to my life’, ‘hulk vs.’, ‘takers’, ‘her’, ‘star wars: episode viii: the last jedi’, ‘scorpion king 5: book of souls’]

2. Topic Modelling with LDA

I have chosen to explore the use of LDA on the ‘description’ textual data to detect similar documents because I hypothesized that there might be previously unknown underlying topics in the movie/TV shows’ storyline which differs from the typical genre classification such as Adventure, Romance.

2.1 Import all the required packages.

# Importing modules
import pandas as pd
import numpy as np
import os
import re
# LDA Model
import gensim
from gensim.utils import simple_preprocess
import gensim.corpora as corpora
from pprint import pprint
from gensim.models import CoherenceModel
import spacy
from nltk.corpus import stopwords
# Import the wordcloud library
from wordcloud import WordCloud
# Visualize the topics
import pyLDAvis.gensim
import pickle
import pyLDAvis

2.2 Data Pre-Processing

Remove non-English words

df[‘description’] = df[‘description’].map(lambda x: re.sub(“([^\x00-\x7F])+”,””, x))

Tokenisation | Tokenisation splits a document into its individual terms. For eg, “I am feeling happy today” → [“I”, “am”, “feeling”, “happy”, “today”]

def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations and special characters
data_words = list(sent_to_words(df[‘description’]))

Removal of stop words | Stop words are commonly occurring words such as ‘this’, ‘am’ which does not value-add to any insight. Here, we are removing words that are found in NLTK library’s existing list of stop words. You can extend the stop words list using the commented code if needed.

stop_words = stopwords.words(‘english’)
# stop_words.extend(['']) #extend existing stop word list if needed
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

Building of bigrams | Bigram is a sequence of 2 word tokens that has co-occurred together and has satisfied the collocation counts defined in the threshold paramter. In our dataset, we found bigrams such as “martial_arts”, “high_school”. These words are more meaningful when they are taken as bigrams. If taken as unigram such as “martial”, “art”, “high”, “school”, the meaning is lost.

# Build the bigram 
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=10)
bigram_mod = gensim.models.phrases.Phraser(bigram)
def make_bigrams(texts):
return [bigram_mod[doc] for doc in texts]

Lemmatization. We convert a word to its valid root form. For eg, “running” > “run”

def lemmatization(texts, allowed_postags=[‘NOUN’, ‘ADJ’, ‘VERB’, ‘ADV’]):
texts_out = []
for sent in texts:
doc = nlp(“ “.join(sent))
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
return texts_out

2.3 Create corpus

# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)
# Filter out tokens that appear in only 1 documents and appear in more than 90% of the documents
id2word.filter_extremes(no_below=2, no_above=0.9)
# Create Corpus
texts = data_lemmatized
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

2.4 Build LDA Model

lda_model = gensim.models.LdaMulticore(corpus=corpus,
id2word=id2word,
num_topics=15,
random_state=100,
chunksize=100,
passes=10,
alpha=0.01,
eta=0.9)

I have set the number of topics to 15 based on Coherence Score. Refer to Github repo for the full code. Please note that LDA output differs slightly each time as it uses a new gamma matrix during each inference, more info here.

2.5 Get the 15 topics and its keywords

# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

2.6 Create Document Topic Matrix

doc_num, topic_num, prob = [], [], []
print(lda_model.get_document_topics(corpus))
for n in range(len(df)):
get_document_topics = lda_model.get_document_topics(corpus[n])
doc_num.append(n)
sorted_doc_topics = Sort_Tuple(get_document_topics)
topic_num.append(sorted_doc_topics[0][0])
prob.append(sorted_doc_topics[0][1])
df[‘Doc’] = doc_num
df[‘Topic’] = topic_num
df[‘Probability’] = prob
df.to_csv(“doc_topic_matrix.csv”, index=False)

2.7 Top 10 Recommended Movies/TV Shows

def recommend_by_storyline(title, df):
recommended = []
top10_list = []

title = title.lower()
df[‘title’] = df[‘title’].str.lower()
topic_num = df[df[‘title’]==title].Topic.values
doc_num = df[df[‘title’]==title].Doc.values

output_df = df[df[‘Topic’]==topic_num[0]].sort_values(‘Probability’, ascending=False).reset_index(drop=True)
index = output_df[output_df[‘Doc’]==doc_num[0]].index[0]

top10_list += list(output_df.iloc[index-5:index].index)
top10_list += list(output_df.iloc[index+1:index+6].index)

output_df[‘title’] = output_df[‘title’].str.title()

for each in top10_list:
recommended.append(output_df.iloc[each].title)

return recommended
recommend_by_storyline(‘Naruto Shippuden : Blood Prison’,scores_df, df)

>> [‘La Viuda Negra’, ‘Power Rangers Super Samurai: Trickster Treat’, ‘Mighty Morphin Alien Rangers’, ‘The Brave’, ‘Saint Seiya: The Lost Canvas’, ‘K-19: The Widowmaker’, ‘Supernature: Wild Flyers’, ‘Pukar’, ‘Barbie: A Fairy Secret’, ‘Sarajevo’]

recommend_by_storyline(‘Avengers: Infinity War’,scores_df, df)

>> [‘The Seven Deadly Sins’, ‘Ninja Turtles: The Next Mutation’, ‘Super Monsters’, ‘Cyborg 009 Vs Devilman’, ‘Get Smart’, “Oh No! It’S An Alien Invasion”, ‘Svaha: The Sixth Finger’, ‘Fullmetal Alchemist: Brotherhood’, “Jake’S Buccaneer Blast”, ‘Maharakshak: Aryan’]

Huge thanks to these amazing resources!

https://www.cse.iitk.ac.in/users/nsrivast/HCC/Recommender_systems_handbook.pdf

https://towardsdatascience.com/how-to-build-from-scratch-a-content-based-movie-recommender-with-natural-language-processing-25ad400eb243

https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0

Thank you for reading! :) Please check out the Github repo for this project’s full code. If you found this article helpful, I would really appreciate if you could follow my account and give this article a clap, thank you!!

--

--

Nicole Sim

An avid learner who can’t stop thinking about new ideas. I love tech, automation, healthcare and entrepreneurship.