Chapter 8

June 10, 2025

Fast-AI Fastbook chapter 8, Collaborative Filtering

This chapter will be all about so-called latent factors. Latent factors are underlying concepts in your data that are not up-front, but can be learned by association.

For example, on Netflix you may have watched lots of movies that are science fiction, full of action, and were made in the 1970s. Netflix may not know these particular properties of the films you have watched, but it will be able to see that other people that have watched the same movies that you watched also tended to watch other movies that are science fiction, full of action, and were made in the 1970s. In other words, to use this approach we don’t necessarily need to know anything about the movies, except who like to watch them.

The key foundational idea is that of Latent Factors. These factors represent some information about the data, that we as humans don’t necessarily understand or know. According to Wikipedia:

In statistics, latent variables (from Latin: present participle of lateo, “lie hidden”[1]) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or measured.

In this chapter we will try and build a relatively simple movie recommendation model using latent factors. For our data we will be using 100k rows of movie reviews, which look like this:

Multiple users have reviewed multiple movies, so the data can also be represented as this:

So we have here a bunch of users and a bunch of scores for movies. We know that higher scores correspond to a user liking that movie, but that is all we know. We do not know what the type of the movie is, how old it is, what the user does in his/her spare time, etc. Those are all factors that have contributed to a review score, but are not directly known. But the information is there, so let’s learn that information.

Learning the Latent Factors

So, learning latent factors looks pretty darn similar to using a “standard” deep learning neural network. Step 1 is: initialize random weights. These parameters will be the latent factors that we will be learning. We give a couple of them to each user, and a couple to each movie. That will look something like this:

We then Matrix Multiply them using the dot-product: (Array1 * Array2).sum()

This gives us a value, that we can plug into a loss function to see how good our prediction is compared to the actual score, and then with SGD we keep on iterating until we are happy with our loss or we run out of time. That is it.

So lets get our data in something workable to show an example. We start with a relatively low level solution, requiring a lot of manual work, but we end with pretty much a one-liner using the Fast.ai library.

Creating the DataLoaders

With some very basic data manipulation, we end up with the following matrix:

Where we have 944 users and 1635 movies. We can create separate matrices of the latent variables using the following code:

n_users  = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_factors = 5

user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)

Here, user_factors is a [944x5] matrix, while movie_factors is a [1635x5] matrix filled with random values sampled from the standard normal distribution (mean = 0, variance = 1)

The dumb thing is that if we now want to multiply tensors, we have to lookup by Id’s in 2 different matrices, which is something deep learning models cannot really do. We can use one-hot encoding, but that is insanely memory INefficient, as it creates big tensors filled with 99% 0’s. So we create what is called an embedding to fix this:

jargon: Embedding: Multiplying by a one-hot-encoded matrix, using the computational shortcut that it can be implemented by simply indexing directly. This is quite a fancy word for a very simple concept. The thing that you multiply the one-hot-encoded matrix by (or, using the computational shortcut, index into directly) is called the embedding matrix.

So lets implement this all from scratch.

Collaborative filtering from scratch

So writing your own Pytorch Module (A “model”) is as simple as:

class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return (users * movies).sum(dim=1)

You need to inherit from Module.
You need to implement a forward method
The input of the model (x in this case) is a tensor of shape [batchsize x 2] where the first column is the user id’s and the second column is the movie id’s.

So the forward method does the following:

Get all the latent factors for all the users in the batch (users is therefor a [batchsize x 5] tensor)
Get all the latent factors for all the movies in the batch (movies is therefor a [batchsize x 5] tensor)
takes the dotproduct of those 2 tensors and returns that (the result is therefor a [64 x 1] tensor since we sum over the first dimension. If we sum over the zero’th dimension, we would have gotten a single number which represented all reviews in the batch (which would have been mathematically impossible, but just for explanatory purposes.)

So a couple of things we can do to improve this simple model (but believe me, it will never get complicated)

Force predictions to be between 0 and 5 (or actually 5.5, since if we cap at 5, then a 5 will never be reached)
Add biases. some users are just more positive or negative in their recommendations than others, and some movies are just plain better or worse than others. But in our dot product representation we do not have any way to encode either of these things

Lets update the model with these 2 properties:

class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.movie_bias = Embedding(n_movies, 1)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        res = (users * movies).sum(dim=1, keepdim=True)
        res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
        return sigmoid_range(res, *self.y_range)

But these additions actually made the model perform worse. A clear sign of overfitting. But we can’t really do data augmentation the way we do with images. So what?

Weight decay (L2 regularization)

Weight decay, or L2 regularization, consists in adding to your loss function the sum of all the weights squared. Why do that? Because when we compute the gradients, it will add a contribution to them that will encourage the weights to be as small as possible.

See it as a way to penalize complexity, without reducing the models actual complexity by removing parameters, or using less layers. Real world data is complex, so models should require atleast some complexity to function properly.

And adding the sum of all weights squared is not the complete answer, since this might lead to the model just deciding to set all weights to 0 and call it a day. We multiply this value with a number between 0 and 1. This is the actual number that is called the weight decay. So we end up with: loss = loss_function() + wd * sum(parameters**2)

But since we use Gradient Descent, we update our weights / parameters using the derivative. So we can also just take the derivative and do the following: parameters.grad += wd * 2 * parameters And you could even just remove the 2 from this function and make the wd hyperparameter twice as big.

Ofcourse this is built into Fast.ai:

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

This model leads to much better results overall.

Interpreting embeddings and biases

So this model can now provide movie recommendations for users given their previous ratings. But we want to look into it. What has it found as latent factors and can we interpret those? What are the best and worst movies? Even when controlling for some variables (that we dont know since they are latent, but still).

So remember we have a introduced biases for the movies as well as for the people. But what do these mean? We know, since we have 1665 movies, that we have a bias tensor of [1665 x 1], 1 number for each movie. Lets get the lowest biases using the following code:

movie_bias = learn.model.movie_bias.squeeze()
idxs = movie_bias.argsort()[:5]
[dls.classes['title'][i] for i in idxs]

This code might require some explanation

torch.squeeze() removes all dimensions of size 1
numpy.argsort() gives the indices that would sort the array. So getting the first 5 indices of the result of argsort gives us the indices of the 5 lowest biases
Because this is all in the same order, we can grab the titles belonging to these 5 indices as being the titles with the lowest biases.

But what does having a low bias mean in this case? Well, looking at the titles of the movies might tell you more:

['Children of the Corn: The Gathering (1996)',
 'Showgirls (1995)',
 'Lawnmower Man 2: Beyond Cyberspace (1996)',
 'Mortal Kombat: Annihilation (1997)',
 'Beautician and the Beast, The (1997)']

These are commonly known to be terrible movies, even if users would have liked them based on the latent factors. So even if someone would have liked movies similar to these, they still do not like these specific movies.

The same can be done for users to find users who give lower than average ratings, even for movies they would have liked according to this model, but this is less useful in this case.

It is also possible to do a PCA (a Principal Component Analysis) to see if we can make any sense of the latent factors the model found, but this is an entire different field, so I wont mention the steps needed here.

Using Fast.ai collab

So we basically built the entire model from scratch. Lets now see what a framework can do for us.

learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))
learn.fit_one_cycle(5, 5e-3, wd=0.1)

And we are done.

Lets see if we can do some more fun stuff with embeddings. For example, lets check how close some movies are to each other.

Embedding distance

Its easy to visualize in a (x,y) plane how close two points are to each other. Using some pythagorean formula we can calculate the distance of the line between the two: $\sqrt{x^{2}+y^{2}}$ It is no different in 50 dimensions however. Lets try to find the movie most similar to “Silence of the Lambs”:

movie_factors = learn.model.i_weight.weight
idx = dls.classes['title'].o2i['Silence of the Lambs, The (1991)']
distances = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])
idx = distances.argsort(descending=True)[1]
dls.classes['title'][idx]

We grab our 50 factors
We get the index to “Silence of the Lambs”
We calculcate the distance using CosineSimilarity for every movie with regards to “Silence of the Lambs”
We use argsort() to sort them descending, meaning the most similar movies are at the top
We get the second entry (because the first entry is “Silence of the Lambs” itself ofcourse)

Deep Learning for collaborative filtering

So this dotproduct model works quite well, and can be used in production settings. But we can also use deep learning to make a model. Why? Because we can.

So we no longer rely on the dotproduct result, but we just create a sequential neural network where we concatenate our embeddings together:

class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)
        self.layers = nn.Sequential(
            nn.Linear(user_sz[1]+item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1))
        self.y_range = y_range
        
    def forward(self, x):
        embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
        x = self.layers(torch.cat(embs, dim=1))
        return sigmoid_range(x, *self.y_range)

The *user_sz and item_sz are predetermined embedding sizes (amount of latent factors). These are determined by the fast.ai framework to work well given the current data.
We build a simple nn.Sequential where we concatenate the embeddings in the first layer, add a Relu, and a single output activation.
Everything else is pretty much similar to the DotProduct model.

Because it is so similar to a dotproduct model, fast.ai lets you use a neural net with a simple boolean:

learn = collab_learner(dls, use_nn=True, y_range=(0, 5.5), layers=[100,50])
learn.fit_one_cycle(5, 5e-3, wd=0.1)

The result however in this case is pretty much identical.

Questionnaire:

What problem does Collaborative Filtering solve? A. Its used for recommendation problems. “If X likes 1, and 2, would X also like 3”? type questions.
How does it solve it? A. It solves this by trying to learn factors that represent some information about the data, that we as humans don’t necessarily understand or know. These factors ARE present in the data however. These are called Latent Factors.
Why might a Collaboratie Filtering Model fail to be a very useful recommendation system? A. Because it is not immune to bias. If you have a relatively small group, which is VERY active and skews the system towards a certain type of recommendation, the model happily tags along.
What does a cross-tab representation of collaborative filtering data look like? A. Like this:
Write the code to create a cross-tab representation of the Movie Lens data? A. So a cross-tab is nothing more than a so-called pivot table, which can easily be generated using Pandas:

import pandas as pd
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
				  names=['user','movie','rating','timestamp'])
crosstab = ratings.pivot_table(index='user', columns='movie', values='rating')
print(crosstab)

What is a Latent Factor? Why is it latent? A. A latent factor is a factor that is present in the data, but not immediately visible. For example in movie ratings, we only have users, movies, and ratings, but there is a lot of information in the combination of the three. If a user gives Action movies consistently high reviews, we could say that that specific user likes action movies, even though there isn’t a column called likesActionMovies. That’s why they are latent.
What is a dot product? Calculate a dot product manually using python lists. A. A dot product is the result of multiplying two vectors and adding all the values up. So in essence it is (v1 * v2).sum()
What does pandas.dataframe.merge do? A. The merge() method merges two dataframes together by a common Id column.
What is an embedding matrix? A. An embedding matrix is a trainable parameter matrix used in machine learning models to represent categorical variables as dense vectors. Each row in the embedding matrix corresponds to a unique category, and the columns represent the latent features of these categories. This allows the model to learn useful representations of the categories during training.
What is the relationship between an embedding and a matrix of one-hot-encoded vectors? A. An embedding can be seen as a computational shortcut for a matrix of one-hot-encoded vectors.
Why do we need embeddings if we could just use one-hot-encoded vectors? A. Because one-hot-encoded vectors are terrible memory inefficient. Also in one-hot-encoded vectors it is impossible to learn relationships between categories.
What does an embedding contain before we start training? A. Just a bunch of random numbers, usually sampled from the standard normal distribution (0, 1)
Create a class and use it A.

class Temp()
	def __init__(self, x):
		self.x = x
	
	def print():
		print(x)

What does x[:,0] return? A. It returns the first column for all rows
X
What is a good loss function to use for movie lens. Why? A. Since we are dealing with numbers between 0 and 5 here, we can easily use an MSE / RMSE loss function, since it is easy to calculate the distance between a prediction and a target in this case.
What would happen if we used CrossEntropy loss? How would we need to change the model? A. Cross-entropy loss is useful if the output is between 0 and 1. It would not work here. We would have to change the model to output a yes/no kind of answer instead of a number 0-5.
What is the use of bias in the dotproduct model? A. The use of bias is to encode that some users are just more positive or negative in their recommendations than others, and some movies are just plain better or worse than others. By including a bias we correct for this environment.
What is another name for weight decay? A. Another name for weight decay is L2 Regularization.
Write the equation for weight decay? A. params += wd * 2 * params
Write the equation for the gradient of weight decay A. loss = loss + wd * (parameters**2).sum()
Why does reducing weights lead to better generalization? A. Real life data is complex, so it requires complex models to model real life data. But you also do not want your models to become too complex to a point where they are basically overfitting. So you want complex models, while also kind of “punish” too much complexity, which is what weight decay kind of is.
What does argsort() do in Python? A. It returns the index of the given item that you can use to sort the array. It feels a bit counter intuitive to not just sort the array, but to return a list of indices which represents a sorted array, but it sounds computationally less expensive. Example: given an array x = [0, 5, 3, 2, 1, 4] , calling x.argsort() returns [0, 4, 3, 2, 5, 1]. Retrieving the elements at these given indices from x results in a sorted array.
Does sorting the movie biases give the same result as averaging overall movie ratings by a movie? Why/Why not? A. No, since the biases are kind of independent from the ratings. If you average the movie ratings, you would just find movies that people don’t like at the bottom. But is that because they are terrible movies, or because people just didn’t like them because of other things?
How do you print the names and details of the layers in the model? A. By just printing the model learn.model.
What is the bootstrapping problem in collaborative filtering? A. The bootstrapping problem refers to the problem that a model cannot make predictions for users it has no data about. A new user with 0 movie reviews has no preference according to the model.
How could you deal with new users? A. ask some question whenever a user signs up. What do they like? What preferences do they have?
How can feedback loops impact collaborative filtering systems? A. small groups of really active users can skew the system their way.
When using a neural network for collaborative filtering, why can we have different numbers of latent factors for movies and users? A. since we concatenate the embeddings, the amount of factors does not matter.
Why is there an nn.Seqential() in the CollabNN model? A. So we can couple multiple layers together to create possibly more complex models.
What kind of model should we use if we want to add metadata about users and items, or information such as date and time, to a collaborative filtering model? A. We should use a tabular model. Good old fancy statistics instead of neural networks.