Let's assume that our system has Iuser users and Jmovie movies. We assign Klatent features to each user and movie in the system. We can construct a matrix factorization as follows:
X represents the latent feature matrix for all users in our system. Θ represents the latent feature matrix for all movies in our system. The matrix product of X and ΘT is the model predicated rating.
XΘT=R^
Let R represents the actual rating we received from the MovieLens dataset. For every missing value in R, we will ignore their contribution to the loss function. Then for every R[i][j] != nil, we define the loss function as follows:
LX,Θ=21Σi,j(XΘT−R)2+2λΣi,kX2+2λΣj,kΘ2
The optimization objective here is to minimize the loss function above. Notice that I have omitted bias terms in here. I've noticed that biases are in general not necessary for matrix factorization. The simultaneous update of user and movie latent vectors will compensate and account for the needed bias. I may be wrong, since I haven't actually written down the math to prove that is the case.
Partial Derivatives & Gradients
Iterative Approach
For a given user i, and he/she has a preference vector of size K, iterate from k=0 to k=K and update each element of his preference vector using the following gradient calculation:
For a given movie j, and it has a feature vector of size K, iterate from k=0 to k=K and update each element of its feature vector using the following gradient calculation:
Remember that
Vectorized Approach
Using chain rule, we can then derive the following results:
An example implemented using numpy. We set gradient of model prediction output to zero if that particular entry is missing in the training rating matrix. It is simply saying we are ignoring nil entries' contribution to the loss function. We use zero to represent nil entries in the training R matrix.
In general vectorized approach is much faster than iterative approach but there is a huge memory constraint. When you run 260,000 users on 45,000 movies, the matrix size is 11,700,000,000 and assuming each matrix entry is a float64. That means it's taking up 93,600,000,000 bytes in memory for just one single matrix. I am pretty sure you don't have a machine that has 100+ GB in RAM. Thus, depending on dataset size, it is sometimes better to write an iterative approach using a static typed performant language (C, C++, Go, or Java/Scala.)
Implementation Details
Please look at the lowrank module inside this folder to find out more about the implementation details.
Gradient Check
Let's confirm that gradients are computed correctly first.
import lowrankimport numpy as npnp.random.seed(0)rand_training_mat = np.random.rand(5, 5)# Randomly remove certain values to make the matrix sparserand_training_mat[rand_training_mat <0.50]=0# Pick out some set of values from the training set to be test set and then remove those values from training setrand_test_mat = np.copy(rand_training_mat)rand_test_mat[rand_training_mat <0.90]=0rand_training_mat[rand_test_mat !=0]=0print'Randomly initialized training sparse matrix'print rand_training_matprint'Randomly initialized test sparse matrix'print rand_test_matfactorizer = lowrank.Factorizer(rand_training_mat, rand_test_mat, feature_dim=3)
CSV data are loaded with 44229 training samples and 5065 test samples from 79 users on 9125 movies
Factorizer is instantiated with U: (79, 10) and M: (9125, 10)
benchmarks = factorizer.train(steps=400, learning_rate=1e-4)steps = [bm[0]for bm in benchmarks]losses = [bm[1]for bm in benchmarks]rmses = [bm[2]for bm in benchmarks]plt.plot(steps, losses, 'o-')plt.show()
☑ training: |████████████████████| 100%- current cost: 16262.495841419233
Root Mean Squared Error
The 100,000 ratings dataset we use is for debugging purpose. The number of ratings we have in this dataset is far too small, the RMSE error is relatively high.
plt.plot(steps, rmses, 'o-')
[<matplotlib.lines.Line2D at 0x7ff386af3310>]
∂xi,k∂L=j∑J(Ri,j−R^i,j)⋅θj,k+λxi,k
∂θj,k∂L=i∑I(Ri,j−R^i,j)⋅xi,k+λθj,k
R^i,j=xi⋅θj
Recall that the output of our low-rank matrices model is R^ and let's find the gradient of L with respect to R^ first. The 21 term will get canceled out by the square term.
∂R^∂L=R^−R
R^=XΘT
Now let's figure out the gradient of R^ with respect to X and Θ: