Online streaming platforms like Netflix have plenty of movies in their repository and if we can build a Recommendation System to recommend relevant movies to users, based on their historical interactions, this would improve customer satisfaction and hence, it will also improve the revenue of the platform. The techniques that we will learn here will not only be limited to movies, it can be any item for which you want to build a recommendation system.
In this project we will be building various recommendation systems:
we are going to use the ratings dataset.
The ratings dataset contains the following attributes:
Sometimes, the installation of the surprise library, which is used to build recommendation systems, faces issues in Jupyter. To avoid any issues, it is advised to use Google Colab for this case study.
Let's start by mounting the Google drive on Colab.
# uncomment if you are using google colab
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
Installing surprise library
# Installing surprise library, only do it for first time
!pip install surprise
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Collecting surprise Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB) Collecting scikit-surprise Downloading scikit-surprise-1.1.3.tar.gz (771 kB) |████████████████████████████████| 771 kB 4.8 MB/s Requirement already satisfied: joblib>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from scikit-surprise->surprise) (1.2.0) Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.8/dist-packages (from scikit-surprise->surprise) (1.21.6) Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.8/dist-packages (from scikit-surprise->surprise) (1.7.3) Building wheels for collected packages: scikit-surprise Building wheel for scikit-surprise (setup.py) ... done Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp38-cp38-linux_x86_64.whl size=2626480 sha256=6da15ab2bdf42251d488dcd9c1b97ca19f9bfa68ccd80768ba6dde0e0f9c25bd Stored in directory: /root/.cache/pip/wheels/af/db/86/2c18183a80ba05da35bf0fb7417aac5cddbd93bcb1b92fd3ea Successfully built scikit-surprise Installing collected packages: scikit-surprise, surprise Successfully installed scikit-surprise-1.1.3 surprise-0.1
# Used to ignore the warning given as output of the code
import warnings
warnings.filterwarnings('ignore')
# Basic libraries of python for numeric and dataframe computations
import numpy as np
import pandas as pd
# Basic library for data visualization
import matplotlib.pyplot as plt
# Slightly advanced library for data visualization
import seaborn as sns
# A dictionary output that does not raise a key error
from collections import defaultdict
# A performance metrics in surprise
from surprise import accuracy
# Class is used to parse a file containing ratings, data should be in structure - user ; item ; rating
from surprise.reader import Reader
# Class for loading datasets
from surprise.dataset import Dataset
# For model tuning model hyper-parameters
from surprise.model_selection import GridSearchCV
# For splitting the rating data in train and test dataset
from surprise.model_selection import train_test_split
# For implementing similarity based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic
# For implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD
# For implementing cross validation
from surprise.model_selection import KFold
# Import the dataset
#rating = pd.read_csv('ratings.csv')
rating = pd.read_csv('/content/drive/MyDrive/Colab/ratings.csv') # Uncomment this line code and comment above line of code if you are using google colab.
Let's check the info of the data
rating.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100004 entries, 0 to 100003 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 userId 100004 non-null int64 1 movieId 100004 non-null int64 2 rating 100004 non-null float64 3 timestamp 100004 non-null int64 dtypes: float64(1), int64(3) memory usage: 3.1 MB
# Dropping timestamp column
rating = rating.drop(['timestamp'], axis=1)
# Printing the top 5 rows of the dataset Hint: use .head()
# Remove _______and complete the code
rating.head()
userId | movieId | rating | |
---|---|---|---|
0 | 1 | 31 | 2.5 |
1 | 1 | 1029 | 3.0 |
2 | 1 | 1061 | 3.0 |
3 | 1 | 1129 | 2.0 |
4 | 1 | 1172 | 4.0 |
plt.figure(figsize = (12, 4))
# Remove _______and complete the code
sns.countplot(x='rating',data=rating)
plt.tick_params(labelsize = 10)
plt.title("Distribution of Ratings ", fontsize = 10)
plt.xlabel("Ratings", fontsize = 10)
plt.ylabel("Number of Ratings", fontsize = 10)
plt.show()
**Write your Answer here:1. From the countplot, rating 4 has highest count with >25000(highest count of ratings) followed by rating 3 with count around 20000.
# Finding number of unique users
#remove _______ and complete the code
rating['userId'].nunique()
671
Write your answer here:__ There are 671 unique users in the dataset.
# Finding number of unique movies
# Remove _______ and complete the code
rating['movieId'].nunique()
9066
Write your answer here:__ There are 9066 unique movies in the given dataset.There is a possibility of 671*9066 =6083286 ratings in the dataset.But we have only 100004 i.e not every user rated the movies. We can build a recommendation system to recommend movies to the used they have not interacted at all.
rating.groupby(['userId', 'movieId']).count()
rating | ||
---|---|---|
userId | movieId | |
1 | 31 | 1 |
1029 | 1 | |
1061 | 1 | |
1129 | 1 | |
1172 | 1 | |
... | ... | ... |
671 | 6268 | 1 |
6269 | 1 | |
6365 | 1 | |
6385 | 1 | |
6565 | 1 |
100004 rows × 1 columns
rating.groupby(['userId', 'movieId']).count()['rating'].sum()
100004
Write your Answer here:__Here the sum is equal to the total observations which means there is only one interaction between user and movies and not more than that.
# Remove _______ and complete the code
rating['movieId'].value_counts()
356 341 296 324 318 311 593 304 260 291 ... 98604 1 103659 1 104419 1 115927 1 6425 1 Name: movieId, Length: 9066, dtype: int64
Write your Answer here:__Movie with movieid=356 has more interactions with 341 users.However there are 330 more users most likely to interact with the movie. We can build a recommendation model to predict the likelihood.
# Plotting distributions of ratings for 341 interactions with movieid 356
plt.figure(figsize=(7,7))
rating[rating['movieId'] == 356]['rating'].value_counts().plot(kind='bar')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()
Write your Answer here:__From the graph, we can interpret that this movie with movieid 356 has most 4 rating followed by rating 5. That is users liked the movie most of the time. Ratings 2 and below are less in count which means very few users disliked and have given less rating.
# Remove _______ and complete the code
rating['userId'].value_counts()
547 2391 564 1868 624 1735 15 1700 73 1610 ... 296 20 289 20 249 20 221 20 1 20 Name: userId, Length: 671, dtype: int64
Write your Answer here:___User with id 547 has interacted with the movies with maximum count of 2391. Again, there are still 6675 movies that the user is likely to get interacted with.
# Finding user-movie interactions distribution
count_interactions = rating.groupby('userId').count()['movieId']
count_interactions
userId 1 20 2 76 3 51 4 204 5 100 ... 667 68 668 20 669 37 670 31 671 115 Name: movieId, Length: 671, dtype: int64
# Plotting user-movie interactions distribution
plt.figure(figsize=(15,7))
# Remove _______ and complete the code
sns.histplot(count_interactions)
plt.xlabel('Number of Interactions by Users')
plt.show()
Write your Answer here:__The distribution is highly skewed to the right which tells that only few users interacted with more than 100 movies.
Rank-based recommendation systems provide recommendations based on the most popular items. This kind of recommendation system is useful when we have cold start problems. Cold start refers to the issue when we get a new user into the system and the machine is not able to recommend movies to the new user, as the user did not have any historical interactions in the dataset. In those cases, we can use rank-based recommendation system to recommend movies to the new user.
To build the rank-based recommendation system, we take average of all the ratings provided to each movie and then rank them based on their average rating.
# Remove _______ and complete the code
# Calculating average ratings
average_rating = rating.groupby('movieId').mean()['rating']
# Calculating the count of ratings
count_rating = rating.groupby('movieId').count()['rating']
# Making a dataframe with the count and average of ratings
final_rating = pd.DataFrame({'avg_rating':average_rating, 'rating_count':count_rating})
final_rating.head()
avg_rating | rating_count | |
---|---|---|
movieId | ||
1 | 3.872470 | 247 |
2 | 3.401869 | 107 |
3 | 3.161017 | 59 |
4 | 2.384615 | 13 |
5 | 3.267857 | 56 |
Now, let's create a function to find the top n movies for a recommendation based on the average ratings of movies. We can also add a threshold for a minimum number of interactions for a movie to be considered for recommendation.
def top_n_movies(data, n, min_interaction=100):
#Finding movies with minimum number of interactions
recommendations = data[data['rating_count'] >= min_interaction]
#Sorting values w.r.t average rating
recommendations = recommendations.sort_values(by='avg_rating', ascending=False)
return recommendations.index[:n]
We can use this function with different n's and minimum interactions to get movies to recommend
# Remove _______ and complete the code
list(top_n_movies(final_rating,5,min_interaction=50))
[858, 318, 969, 913, 1221]
# Remove _______ and complete the code
list(top_n_movies(final_rating,5,min_interaction=100))
[858, 318, 1221, 50, 527]
# Remove _______ and complete the code
list(top_n_movies(final_rating,5,min_interaction=200))
[858, 318, 50, 527, 608]
Now that we have seen how to apply the Rank-Based Recommendation System, let's apply the Collaborative Filtering Based Recommendation Systems.
In the above interactions matrix, out of users B and C, which user is most likely to interact with the movie, "The Terminal"?
In this type of recommendation system, we do not need any information
about the users or items. We only need user item interaction data to build a collaborative recommendation system. For example -
Types of Collaborative Filtering
Similarity/Neighborhood based
Model based
cosine
similarity and using KNN to find similar users which are the nearest neighbor to the given user. surprise
, to build the remaining models. Let's first import the necessary classes and functions from this library.Below we are loading the rating
dataset, which is a pandas DataFrame, into a different format called surprise.dataset.DatasetAutoFolds
, which is required by this library. To do this, we will be using the classes Reader
and Dataset
. Finally splitting the data into train and test set.
# Instantiating Reader scale with expected rating scale
reader = Reader(rating_scale=(0, 5))
# Loading the rating dataset
data = Dataset.load_from_df(rating[['userId', 'movieId', 'rating']], reader)
# Splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)
# Remove _______ and complete the code
sim_options = {'name': 'cosine',
'user_based': True}
# Defining Nearest neighbour algorithm
algo_knn_user = KNNBasic(sim_options=sim_options,verbose=False)
# Train the algorithm on the trainset or fitting the model on train dataset
algo_knn_user.fit(trainset)
# Predict ratings for the testset
predictions = algo_knn_user.test(testset)
# Then compute RMSE
accuracy.rmse(predictions)
RMSE: 0.9925
0.9924509041520163
Write your Answer here:__ From the baseline uder model, RMSE=0.9925. We can use improve the RMSE score by GridSearchCV by tuning the hyperparameters.
Let's us now predict rating for an user with userId=4
and for movieId=10
# Remove _______ and complete the code
algo_knn_user.predict(4,10, r_ui=4, verbose=True)
user: 4 item: 10 r_ui = 4.00 est = 3.62 {'actual_k': 40, 'was_impossible': False}
Prediction(uid=4, iid=10, r_ui=4, est=3.6244912065910952, details={'actual_k': 40, 'was_impossible': False})
Write your Answer here:___From the prediction, the estimated user-item pair is 3.62 whereas the actual user-item pair is 4
Let's predict the rating for the same userId=4
but for a movie which this user has not interacted before i.e. movieId=3
# Remove _______ and complete the code
algo_knn_user.predict(4,3,verbose=True)
user: 4 item: 3 r_ui = None est = 3.20 {'actual_k': 40, 'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=3.202703552548654, details={'actual_k': 40, 'was_impossible': False})
Write your Answer here:__Here the user-item pair interaction is predicted as 3.20 whereas originally there was zero user-item pair interaction. We can improve this by tuning the hyperparameters.
Below we will be tuning hyper-parmeters for the KNNBasic
algorithms. Let's try to understand different hyperparameters of KNNBasic algorithm -
For more details please refer the official documentation https://surprise.readthedocs.io/en/stable/knn_inspired.html
# Remove _______ and complete the code
# Setting up parameter grid to tune the hyperparameters
param_grid = {'k': [20, 30, 40], 'min_k': [3, 6, 9],
'sim_options': {'name': ['msd', 'cosine'],
'user_based': [True]}
}
# Performing 3-fold cross validation to tune the hyperparameters
grid_obj = GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)
# Fitting the data
grid_obj.fit(data)
# Best RMSE score
print(grid_obj.best_score['rmse'])
# Combination of parameters that gave the best RMSE score
print(grid_obj.best_params['rmse'])
0.9653356985061953 {'k': 20, 'min_k': 3, 'sim_options': {'name': 'msd', 'user_based': True}}
Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above.
Below we are analysing evaluation metrics - RMSE and MAE at each and every split to analyze the impact of each value of hyperparameters
results_df = pd.DataFrame.from_dict(grid_obj.cv_results)
results_df.head()
Now, let's build the final model by using tuned values of the hyperparameters, which we received by using grid search cross-validation.
# Remove _______ and complete the code
sim_options = {'name': 'msd',
'user_based': True}
# Using the optimal similarity measure for user-user based collaborative filtering
# Creating an instance of KNNBasic with optimal hyperparameter values
similarity_algo_optimized_user = KNNBasic(sim_options=sim_options, k=20, min_k=3,verbose=False)
# Training the algorithm on the trainset
similarity_algo_optimized_user.fit(trainset)
# Predicting ratings for the testset
predictions = similarity_algo_optimized_user.test(testset)
# Computing RMSE on testset
accuracy.rmse(predictions)
RMSE: 0.9571
0.9571445417153293
Write your Answer here:__After tuning the hyperparameters the RMSE reduced to 0.9571 from 0.9925. We have slightly improved the model performance by the tuning.
Let's us now predict rating for an user with userId=4
and for movieId=10
with the optimized model
# Remove _______ and complete the code
similarity_algo_optimized_user.predict(4,0, r_ui=4, verbose=True)
user: 4 item: 0 r_ui = 4.00 est = 3.55 {'was_impossible': True, 'reason': 'User and/or item is unknown.'}
Prediction(uid=4, iid=0, r_ui=4, est=3.5459045285801785, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})
Write your Answer here:__Comparing baseline model and optimized model, the predicted values are 1. baseline model-3.62 2.Optimized model-3.55
Below we are predicting rating for the same userId=4
but for a movie which this user has not interacted before i.e. movieId=3
, by using the optimized model as shown below -
# Remove _______ and complete the code
similarity_algo_optimized_user.predict(4,3, verbose=True)
user: 4 item: 3 r_ui = None est = 3.72 {'actual_k': 20, 'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=3.7228745701935386, details={'actual_k': 20, 'was_impossible': False})
Write your Answer here:__Comparing baseline model and optimized model, the predicted values are 1. baseline model-3.20 2.Optimized model-3.72
We can also find out the similar users to a given user or its nearest neighbors based on this KNNBasic algorithm. Below we are finding 5 most similar user to the userId=4
based on the msd
distance metric
similarity_algo_optimized_user.get_neighbors(4, k=5)
[665, 417, 647, 654, 260]
Below we will be implementing a function where the input parameters are -
def get_recommendations(data, user_id, top_n, algo):
# Creating an empty list to store the recommended movie ids
recommendations = []
# Creating an user item interactions matrix
user_item_interactions_matrix = data.pivot(index='userId', columns='movieId', values='rating')
# Extracting those movie ids which the user_id has not interacted yet
non_interacted_movies = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
# Looping through each of the movie id which user_id has not interacted yet
for item_id in non_interacted_movies:
# Predicting the ratings for those non interacted movie ids by this user
est = algo.predict(user_id, item_id).est
# Appending the predicted ratings
recommendations.append((item_id, est))
# Sorting the predicted ratings in descending order
recommendations.sort(key=lambda x: x[1], reverse=True)
return recommendations[:top_n] # returing top n highest predicted rating movies for this user
#remove _______ and complete the code
recommendations = get_recommendations(rating,4,5,similarity_algo_optimized_user)
recommendations
[(309, 5), (3038, 5), (6273, 4.928202652354184), (98491, 4.863224466679252), (2721, 4.845513973527148)]
# Remove _______ and complete the code
# Definfing similarity measure
sim_options = {'name': 'cosine',
'user_based': False}
# Defining Nearest neighbour algorithm
algo_knn_item = KNNBasic(sim_options=sim_options,verbose=False)
# Train the algorithm on the trainset or fitting the model on train dataset
algo_knn_item.fit(trainset)
# Predict ratings for the testset
predictions = algo_knn_item.test(testset)
# Then compute RMSE
accuracy.rmse(predictions)
RMSE: 1.0032
1.003221450633729
Write your Answer here:__The RMSE for the baseline item based model is 1.0032 .We can tune the hyperparameters to reduce it.
Let's us now predict rating for an user with userId=4
and for movieId=10
# Remove _______ and complete the code
algo_knn_item.predict(4,0, r_ui=4, verbose=True)
user: 4 item: 0 r_ui = 4.00 est = 3.55 {'was_impossible': True, 'reason': 'User and/or item is unknown.'}
Prediction(uid=4, iid=0, r_ui=4, est=3.5459045285801785, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})
Write your Answer here:__The actual rating for the user-item interaction is 4 and the predicted is 3.55 using similarity based model.
Let's predict the rating for the same userId=4
but for a movie which this user has not interacted before i.e. movieId=3
# Remove _______ and complete the code
algo_knn_item.predict(4,3,verbose=True)
user: 4 item: 3 r_ui = None est = 4.07 {'actual_k': 40, 'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=4.071601862880049, details={'actual_k': 40, 'was_impossible': False})
Write your Answer here:__ The predicted user-item interaction is 4.07 whereas the actual user-item interaction is zero.We tune the hyperparameters.
# Remove _______ and complete the code
# Setting up parameter grid to tune the hyperparameters
param_grid = {'k': [20, 30,40], 'min_k': [3,6,9],
'sim_options': {'name': ['msd', 'cosine'],
'user_based': [False]}
}
# Performing 3-fold cross validation to tune the hyperparameters
grid_obj = GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)
# Fitting the data
grid_obj.fit(data)
# Best RMSE score
print(grid_obj.best_score['rmse'])
# Combination of parameters that gave the best RMSE score
print(grid_obj.best_params['rmse'])
0.9399929753224469 {'k': 40, 'min_k': 3, 'sim_options': {'name': 'msd', 'user_based': False}}
Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above
Below we are analysing evaluation metrics - RMSE and MAE at each and every split to analyze the impact of each value of hyperparameters
results_df = pd.DataFrame.from_dict(grid_obj.cv_results)
results_df.head()
split0_test_rmse | split1_test_rmse | split2_test_rmse | mean_test_rmse | std_test_rmse | rank_test_rmse | split0_test_mae | split1_test_mae | split2_test_mae | mean_test_mae | std_test_mae | rank_test_mae | mean_fit_time | std_fit_time | mean_test_time | std_test_time | params | param_k | param_min_k | param_sim_options | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.958158 | 0.950637 | 0.943976 | 0.950924 | 0.005794 | 7 | 0.739567 | 0.734922 | 0.727845 | 0.734111 | 0.004820 | 7 | 5.021798 | 1.125042 | 16.716629 | 1.318054 | {'k': 20, 'min_k': 3, 'sim_options': {'name': ... | 20 | 3 | {'name': 'msd', 'user_based': False} |
1 | 1.021201 | 1.011993 | 1.005091 | 1.012762 | 0.006599 | 16 | 0.796983 | 0.790784 | 0.782162 | 0.789976 | 0.006077 | 16 | 6.838112 | 1.777080 | 13.569611 | 2.775545 | {'k': 20, 'min_k': 3, 'sim_options': {'name': ... | 20 | 3 | {'name': 'cosine', 'user_based': False} |
2 | 0.958155 | 0.950605 | 0.944098 | 0.950953 | 0.005744 | 8 | 0.739654 | 0.734957 | 0.728037 | 0.734216 | 0.004771 | 8 | 2.813337 | 0.281329 | 11.369127 | 0.072564 | {'k': 20, 'min_k': 6, 'sim_options': {'name': ... | 20 | 6 | {'name': 'msd', 'user_based': False} |
3 | 1.021404 | 1.012055 | 1.005280 | 1.012913 | 0.006610 | 17 | 0.797123 | 0.790878 | 0.782375 | 0.790125 | 0.006044 | 17 | 3.878078 | 0.134031 | 13.151879 | 1.024427 | {'k': 20, 'min_k': 6, 'sim_options': {'name': ... | 20 | 6 | {'name': 'cosine', 'user_based': False} |
4 | 0.959678 | 0.950621 | 0.944010 | 0.951436 | 0.006422 | 9 | 0.740640 | 0.735047 | 0.728180 | 0.734622 | 0.005095 | 9 | 2.792980 | 0.271339 | 15.215239 | 2.727743 | {'k': 20, 'min_k': 9, 'sim_options': {'name': ... | 20 | 9 | {'name': 'msd', 'user_based': False} |
Now let's build the final model by using tuned values of the hyperparameters which we received by using grid search cross-validation.
# Remove _______ and complete the code
# Creating an instance of KNNBasic with optimal hyperparameter values
similarity_algo_optimized_item = KNNBasic(sim_options= {'name': 'msd','user_based': False}, k=40, min_k=3,verbose=False)
# Training the algorithm on the trainset
similarity_algo_optimized_item.fit(trainset)
# Predicting ratings for the testset
predictions = similarity_algo_optimized_item.test(testset)
# Computing RMSE on testset
accuracy.rmse(predictions)
RMSE: 0.9433
0.9433184999641279
Write your Answer here:__By tuning the hyperparameters the RMSE has increased to 0.9433 whereas the baseline model RMSE is 1.0032. Hence the model has not shown improvement by doing GridSearchCV.
Let's us now predict rating for an user with userId=4
and for movieId=10
with the optimized model as shown below
# Remove _______ and complete the code
similarity_algo_optimized_item.predict(4,0, r_ui=4, verbose=True)
user: 4 item: 0 r_ui = 4.00 est = 3.55 {'was_impossible': True, 'reason': 'User and/or item is unknown.'}
Prediction(uid=4, iid=0, r_ui=4, est=3.5459045285801785, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})
Write your Answer here:___Comparing with the baseline model, the predicted user-item intreraction remains the same with value 3.55.
Let's predict the rating for the same userId=4
but for a movie which this user has not interacted before i.e. movieId=3
, by using the optimized model:
# Remove _______ and complete the code
similarity_algo_optimized_item.predict(4, 3, verbose=True)
user: 4 item: 3 r_ui = None est = 3.87 {'actual_k': 40, 'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=3.865175609312417, details={'actual_k': 40, 'was_impossible': False})
Write your Answer here:__The user-item interaction for the optimzed model is 3.87 where it is 4.07 for the baseline model whereas the actual rating is 0.
We can also find out the similar items to a given item or its nearest neighbors based on this KNNBasic algorithm. Below we are finding 5 most similar items to the movieId=3
based on the msd
distance metric
# Remove _______ and complete the code
similarity_algo_optimized_item.get_neighbors(3, k=5)
# Remove _______ and complete the code
recommendations = get_recommendations(rating, 4, 5, similarity_algo_optimized_item)
recommendations
Model-based Collaborative Filtering is a personalized recommendation system, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use latent features to find recommendations for each user.
Latent Features: The features that are not present in the empirical data but can be inferred from the data. For example:
Now if we notice the above movies closely:
Here Action, Romance, Suspense and Comedy are latent features of the corresponding movies. Similarly, we can compute the latent features for users as shown below:
SVD is used to compute the latent features from the user-item matrix. But SVD does not work when we miss values in the user-item matrix.
First we need to convert the below movie-rating dataset:
into an user-item matrix as shown below:
We have already done this above while computing cosine similarities.
SVD decomposes this above matrix into three separate matrices:
the above matrix is a n x k matrix, where:
the above matrix is a k x k matrix, where:
the above matrix is a kxn matrix, where:
# Remove _______ and complete the code
# Using SVD matrix factorization
algo_svd = SVD()
# Training the algorithm on the trainset
algo_svd.fit(trainset)
# Predicting ratings for the testset
predictions = algo_svd.test(testset)
# Computing RMSE on the testset
accuracy.rmse(predictions)
RMSE: 0.9034
0.9034198535037269
Write your Answer here:___RMSE for baseline SVD on the testset is 0.9034 which is lower that RMSE for baseline similarlity based model(1.0032) and even lesser than optimized similarity based model(0.9433).
Let's us now predict rating for an user with userId=4
and for movieId=10
# Remove _______ and complete the code
algo_svd.predict(4, 10, r_ui=4, verbose=True)
user: 4 item: 10 r_ui = 4.00 est = 4.15 {'was_impossible': False}
Prediction(uid=4, iid=10, r_ui=4, est=4.1542603434778105, details={'was_impossible': False})
Write your Answer here:__We can see that using matrix factorization model, the predicted user-item pair is 4.15 whereas the actual rating is 4. We have slightly over estimated the rating. We can improve this by tuning the hyperparameters.
Let's predict the rating for the same userId=4
but for a movie which this user has not interacted before i.e. movieId=3
:
# Remove _______ and complete the code
algo_svd.predict(4, 3, verbose=True)
user: 4 item: 3 r_ui = None est = 3.53 {'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=3.5325352777024848, details={'was_impossible': False})
Write your Answer here:_We have estimated the rating as 3.55 whereas the actual user-item pair rating is zero.
In SVD, rating is predicted as -
If user $u$ is unknown, then the bias $b_{u}$ and the factors $p_{u}$ are assumed to be zero. The same applies for item $i$ with $b_{i}$ and $q_{i}$.
To estimate all the unknown, we minimize the following regularized squared error:
The minimization is performed by a very straightforward stochastic gradient descent:
There are many hyperparameters to tune in this algorithm, you can find a full list of hyperparameters here
Below we will be tuning only three hyperparameters -
# Remove _______ and complete the code
# Set the parameter space to tune
param_grid = {'n_epochs': [10, 20, 30], 'lr_all': [0.001, 0.005, 0.01],
'reg_all': [0.2, 0.4, 0.6]}
# Performing 3-fold gridsearch cross validation
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)
# Fitting data
gs.fit(data)
# Best RMSE score
print(gs.best_score['rmse'])
# Combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])
0.8951985726464798 {'n_epochs': 30, 'lr_all': 0.01, 'reg_all': 0.2}
Once the grid search is complete, we can get the optimal values for each of those hyperparameters, as shown above.
Below we are analysing evaluation metrics - RMSE and MAE at each and every split to analyze the impact of each value of hyperparameters
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df.head()
split0_test_rmse | split1_test_rmse | split2_test_rmse | mean_test_rmse | std_test_rmse | rank_test_rmse | split0_test_mae | split1_test_mae | split2_test_mae | mean_test_mae | std_test_mae | rank_test_mae | mean_fit_time | std_fit_time | mean_test_time | std_test_time | params | param_n_epochs | param_lr_all | param_reg_all | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.940377 | 0.945866 | 0.944982 | 0.943742 | 0.002406 | 25 | 0.737076 | 0.739329 | 0.738857 | 0.738420 | 0.000970 | 25 | 0.800517 | 0.047832 | 0.459344 | 0.027451 | {'n_epochs': 10, 'lr_all': 0.001, 'reg_all': 0.2} | 10 | 0.001 | 0.2 |
1 | 0.945583 | 0.949379 | 0.949643 | 0.948202 | 0.001855 | 26 | 0.742707 | 0.743683 | 0.743896 | 0.743429 | 0.000518 | 26 | 0.782589 | 0.020786 | 0.436198 | 0.006444 | {'n_epochs': 10, 'lr_all': 0.001, 'reg_all': 0.4} | 10 | 0.001 | 0.4 |
2 | 0.951009 | 0.954499 | 0.954042 | 0.953183 | 0.001549 | 27 | 0.748806 | 0.749643 | 0.749080 | 0.749176 | 0.000348 | 27 | 0.785872 | 0.034149 | 0.425623 | 0.003899 | {'n_epochs': 10, 'lr_all': 0.001, 'reg_all': 0.6} | 10 | 0.001 | 0.6 |
3 | 0.905806 | 0.909565 | 0.908088 | 0.907820 | 0.001546 | 10 | 0.702849 | 0.703889 | 0.702569 | 0.703103 | 0.000568 | 9 | 0.817078 | 0.030645 | 0.474087 | 0.047315 | {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.2} | 10 | 0.005 | 0.2 |
4 | 0.913282 | 0.916795 | 0.914698 | 0.914925 | 0.001443 | 15 | 0.710566 | 0.711731 | 0.710063 | 0.710787 | 0.000698 | 15 | 0.792537 | 0.030592 | 0.430651 | 0.011269 | {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4} | 10 | 0.005 | 0.4 |
Now, we will the build final model by using tuned values of the hyperparameters, which we received using grid search cross-validation above.
# Remove _______ and complete the code
# Building the optimized SVD model using optimal hyperparameter search
svd_algo_optimized = SVD(n_epochs= 30,lr_all= 0.01, reg_all= 0.2)
# Training the algorithm on the trainset
svd_algo_optimized.fit(trainset)
# Predicting ratings for the testset
predictions = svd_algo_optimized.test(testset)
# Computing RMSE
accuracy.rmse(predictions)
RMSE: 0.8952
0.8951788601358034
Let's us now predict rating for an user with userId=4
and for movieId=10
with the optimized model
# Remove _______ and complete the code
svd_algo_optimized.predict(4, 10, r_ui=4, verbose=True)
user: 4 item: 10 r_ui = 4.00 est = 3.99 {'was_impossible': False}
Prediction(uid=4, iid=10, r_ui=4, est=3.989244579160853, details={'was_impossible': False})
Write your Answer here:_Using optimized SVD model, the estimated rating is 3.99 which is close to the actual rating i.e 4.
Let's predict the rating for the same userId=4
but for a movie which this user has not interacted before i.e. movieId=3
:
# Remove _______ and complete the code
svd_algo_optimized.predict(4, 3, verbose=True)
user: 4 item: 3 r_ui = None est = 3.63 {'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=3.6293926107395005, details={'was_impossible': False})
# Remove _______ and complete the code
get_recommendations(rating, 4, 5, svd_algo_optimized)
[(1192, 4.992219812811331), (116, 4.961040017697183), (926, 4.957147548937894), (1948, 4.927934951241887), (3310, 4.922485554135631)]
Below we are comparing the rating predictions of users for those movies which has been already watched by an user. This will help us to understand how well are predictions are as compared to the actual ratings provided by users
def predict_already_interacted_ratings(data, user_id, algo):
# Creating an empty list to store the recommended movie ids
recommendations = []
# Creating an user item interactions matrix
user_item_interactions_matrix = data.pivot(index='userId', columns='movieId', values='rating')
# Extracting those movie ids which the user_id has interacted already
interacted_movies = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].notnull()].index.tolist()
# Looping through each of the movie id which user_id has interacted already
for item_id in interacted_movies:
# Extracting actual ratings
actual_rating = user_item_interactions_matrix.loc[user_id, item_id]
# Predicting the ratings for those non interacted movie ids by this user
predicted_rating = algo.predict(user_id, item_id).est
# Appending the predicted ratings
recommendations.append((item_id, actual_rating, predicted_rating))
# Sorting the predicted ratings in descending order
recommendations.sort(key=lambda x: x[1], reverse=True)
return pd.DataFrame(recommendations, columns=['movieId', 'actual_rating', 'predicted_rating']) # returing top n highest predicted rating movies for this user
Here we are comparing the predicted ratings by similarity based recommendation
system against actual ratings for userId=7
predicted_ratings_for_interacted_movies = predict_already_interacted_ratings(rating, 7, similarity_algo_optimized_item)
df = predicted_ratings_for_interacted_movies.melt(id_vars='movieId', value_vars=['actual_rating', 'predicted_rating'])
sns.displot(data=df, x='value', hue='variable', kde=True);
Write your Answer here:__From the distribution plot,the predicted ratings value are more in around 3 to 4 whereas the actual rating values are discret from 1 to 5. The distribution of predicted values are only between 2.5 and 4.The actual ratings distribution has its peak at 3.
Below we are comparing the predicted ratings by matrix factorization based recommendation
system against actual ratings for userId=7
predicted_ratings_for_interacted_movies = predict_already_interacted_ratings(rating, 7, svd_algo_optimized)
df = predicted_ratings_for_interacted_movies.melt(id_vars='movieId', value_vars=['actual_rating', 'predicted_rating'])
sns.displot(data=df, x='value', hue='variable', kde=True);
# Instantiating Reader scale with expected rating scale
reader = Reader(rating_scale=(0, 5))
# Loading the rating dataset
data = Dataset.load_from_df(rating[['userId', 'movieId', 'rating']], reader)
# Splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)
RMSE is not the only metric we can use here. We can also examine two fundamental measures, precision and recall. We also add a parameter k which is helpful in understanding problems with multiple rating outputs.
Precision@k - It is the fraction of recommended items that are relevant in top k
predictions. Value of k is the number of recommendations to be provided to the user. One can choose a variable number of recommendations to be given to a unique user.
Recall@k - It is the fraction of relevant items that are recommended to the user in top k
predictions.
Recall - It is the fraction of actually relevant items that are recommended to the user i.e. if out of 10 relevant movies, 6 are recommended to the user then recall is 0.60. Higher the value of recall better is the model. It is one of the metrics to do the performance assessment of classification models.
Precision - It is the fraction of recommended items that are relevant actually i.e. if out of 10 recommended items, 6 are found relevant by the user then precision is 0.60. The higher the value of precision better is the model. It is one of the metrics to do the performance assessment of classification models.
See the Precision and Recall @ k section of your notebook and follow the instructions to compute various precision/recall values at various values of k.
To know more about precision recall in Recommendation systems refer to these links :
https://surprise.readthedocs.io/en/stable/FAQ.html
https://medium.com/@m_n_malaeb/recall-and-precision-at-k-for-recommender-systems-618483226c54
# Function can be found on surprise documentation FAQs
def precision_recall_at_k(predictions, k=10, threshold=3.5):
"""Return precision and recall at k metrics for each user"""
# First map the predictions to each user.
user_est_true = defaultdict(list)
for uid, _, true_r, est, _ in predictions:
user_est_true[uid].append((est, true_r))
precisions = dict()
recalls = dict()
for uid, user_ratings in user_est_true.items():
# Sort user ratings by estimated value
user_ratings.sort(key=lambda x: x[0], reverse=True)
# Number of relevant items
n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
# Number of recommended items in top k
n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])
# Number of relevant and recommended items in top k
n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
for (est, true_r) in user_ratings[:k])
# Precision@K: Proportion of recommended items that are relevant
# When n_rec_k is 0, Precision is undefined. We here set it to 0.
precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0
# Recall@K: Proportion of relevant items that are recommended
# When n_rel is 0, Recall is undefined. We here set it to 0.
recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
return precisions, recalls
# A basic cross-validation iterator.
kf = KFold(n_splits=5)
# Make list of k values
K = [5, 10]
# Remove _______ and complete the code
# Make list of models
models = [algo_knn_user, similarity_algo_optimized_user,algo_knn_item,similarity_algo_optimized_item, algo_svd, svd_algo_optimized]
for k in K:
for model in models:
print('> k={}, model={}'.format(k,model.__class__.__name__))
p = []
r = []
for trainset, testset in kf.split(data):
model.fit(trainset)
predictions = model.test(testset, verbose=False)
precisions, recalls = precision_recall_at_k(predictions, k=k, threshold=3.5)
# Precision and recall can then be averaged over all users
p.append(sum(prec for prec in precisions.values()) / len(precisions))
r.append(sum(rec for rec in recalls.values()) / len(recalls))
print('-----> Precision: ', round(sum(p) / len(p), 3))
print('-----> Recall: ', round(sum(r) / len(r), 3))
> k=5, model=KNNBasic -----> Precision: 0.769 -----> Recall: 0.413 > k=5, model=KNNBasic -----> Precision: 0.773 -----> Recall: 0.416 > k=5, model=KNNBasic -----> Precision: 0.605 -----> Recall: 0.326 > k=5, model=KNNBasic -----> Precision: 0.683 -----> Recall: 0.355 > k=5, model=SVD -----> Precision: 0.754 -----> Recall: 0.386 > k=5, model=SVD -----> Precision: 0.748 -----> Recall: 0.384 > k=10, model=KNNBasic -----> Precision: 0.75 -----> Recall: 0.545 > k=10, model=KNNBasic -----> Precision: 0.752 -----> Recall: 0.559 > k=10, model=KNNBasic -----> Precision: 0.594 -----> Recall: 0.471 > k=10, model=KNNBasic -----> Precision: 0.664 -----> Recall: 0.508 > k=10, model=SVD -----> Precision: 0.739 -----> Recall: 0.522 > k=10, model=SVD -----> Precision: 0.725 -----> Recall: 0.524
7.1 Compare the results from the base line user-user and item-item based models.
7.2 How do these baseline models compare to each other with respect to the tuned user-user and item-item models?
7.3 The matrix factorization model is different from the collaborative filtering models. Briefly describe this difference. Also, compare the RMSE and precision recall for the models.
7.4 Does it improve? Can you offer any reasoning as to why that might be?
Write your Answer here:__
In this case study, we saw three different ways of building recommendation systems:
We also understood advantages/disadvantages of these recommendation systems and when to use which kind of recommendation systems. Once we build these recommendation systems, we can use A/B Testing to measure the effectiveness of these systems.
Here is an article explaining how Amazon use A/B Testing to measure effectiveness of its recommendation systems.