Low Rank Matrix Factorization
Formulation
Let's assume that our system has users and movies. We assign features to each user and movie in the system. We can construct a matrix factorization as follows:
represents the latent feature matrix for all users in our system. represents the latent feature matrix for all movies in our system. The matrix product of and is the model predicated rating.
Let represents the actual rating we received from the MovieLens dataset. For every missing value in , we will ignore their contribution to the loss function. Then for every R[i][j] != nil, we define the loss function as follows:
The optimization objective here is to minimize the loss function above. Notice that I have omitted bias terms in here. I've noticed that biases are in general not necessary for matrix factorization. The simultaneous update of user and movie latent vectors will compensate and account for the needed bias. I may be wrong, since I haven't actually written down the math to prove that is the case.
Partial Derivatives & Gradients
Iterative Approach
For a given user i, and he/she has a preference vector of size K, iterate from k=0 to k=K and update each element of his preference vector using the following gradient calculation:
For a given movie j, and it has a feature vector of size K, iterate from k=0 to k=K and update each element of its feature vector using the following gradient calculation:
Remember that
Vectorized Approach
Recall that the output of our low-rank matrices model is and let's find the gradient of with respect to first. The term will get canceled out by the square term.
Now let's figure out the gradient of with respect to and :
Using chain rule, we can then derive the following results:
An example implemented using numpy. We set gradient of model prediction output to zero if that particular entry is missing in the training rating matrix. It is simply saying we are ignoring nil entries' contribution to the loss function. We use zero to represent nil entries in the training R matrix.
Caution
In general vectorized approach is much faster than iterative approach but there is a huge memory constraint. When you run 260,000 users on 45,000 movies, the matrix size is 11,700,000,000 and assuming each matrix entry is a float64. That means it's taking up 93,600,000,000 bytes in memory for just one single matrix. I am pretty sure you don't have a machine that has 100+ GB in RAM. Thus, depending on dataset size, it is sometimes better to write an iterative approach using a static typed performant language (C, C++, Go, or Java/Scala.)
Implementation Details
Please look at the lowrank module inside this folder to find out more about the implementation details.
Gradient Check
Let's confirm that gradients are computed correctly first.
Real Data

Root Mean Squared Error
The 100,000 ratings dataset we use is for debugging purpose. The number of ratings we have in this dataset is far too small, the RMSE error is relatively high.

Last updated