Low Rank Matrix Factorization

Formulation

Let's assume that our system has IuserI_{user} users and JmovieJ_{movie} movies. We assign KlatentK_{latent} features to each user and movie in the system. We can construct a matrix factorization as follows:

x0,0x0,1x0,2...x0,Kx1,0............x2,0...........................xI,0.........xI,Kθ0,0θ0,1θ0,2...θ0,Kθ1,0............θ2,0...........................θJ,0.........θJ,KT=r0,0r0,1r0,2...r0,Jr1,0............r2,0...........................rI,0.........rI,J\begin{vmatrix} x_{0,0} & x_{0,1} & x_{0, 2} & ... & x_{0, K} \\ x_{1,0} & ... & ... & ... & ... \\ x_{2,0} & ... & ... & ... & ... \\ ... & ... & ... & ... & ... \\ x_{I,0} & ... & ... & ... & x_{I, K} \end{vmatrix} \begin{vmatrix} \theta_{0,0} & \theta_{0,1} & \theta_{0, 2} & ... & \theta_{0, K} \\ \theta_{1,0} & ... & ... & ... & ... \\ \theta_{2,0} & ... & ... & ... & ... \\ ... & ... & ... & ... & ... \\ \theta_{J,0} & ... & ... & ... & \theta_{J, K} \end{vmatrix}^{T} = \begin{vmatrix} r_{0,0} & r_{0,1} & r_{0, 2} & ... & r_{0, J} \\ r_{1,0} & ... & ... & ... & ... \\ r_{2,0} & ... & ... & ... & ... \\ ... & ... & ... & ... & ... \\ r_{I,0} & ... & ... & ... & r_{I, J} \end{vmatrix}

XX represents the latent feature matrix for all users in our system. Θ\Theta represents the latent feature matrix for all movies in our system. The matrix product of XX and ΘT\Theta^{T} is the model predicated rating.

XΘT=R^X\Theta^{T} = \hat{R}

Let RR represents the actual rating we received from the MovieLens dataset. For every missing value in RR, we will ignore their contribution to the loss function. Then for every R[i][j] != nil, we define the loss function as follows:

LX,Θ=12Σi,j(XΘTR)2+λ2Σi,kX2+λ2Σj,kΘ2L_{X, \Theta} = \frac{1}{2}\Sigma_{i,j} (X\Theta^{T} - R)^{2} + \frac{\lambda}{2}\Sigma_{i, k}X^{2} + \frac{\lambda}{2}\Sigma_{j, k}\Theta^{2}

The optimization objective here is to minimize the loss function above. Notice that I have omitted bias terms in here. I've noticed that biases are in general not necessary for matrix factorization. The simultaneous update of user and movie latent vectors will compensate and account for the needed bias. I may be wrong, since I haven't actually written down the math to prove that is the case.

Partial Derivatives & Gradients

Iterative Approach

For a given user i, and he/she has a preference vector of size K, iterate from k=0 to k=K and update each element of his preference vector using the following gradient calculation:

Lxi,k=jJ(Ri,jR^i,j)θj,k+λxi,k\frac{\partial L}{\partial x_{i, k}} = \sum_{j}^{J} (R_{i, j} - \hat{R}_{i, j})\cdot \theta_{j, k} + \lambda x_{i, k}

For a given movie j, and it has a feature vector of size K, iterate from k=0 to k=K and update each element of its feature vector using the following gradient calculation:

Lθj,k=iI(Ri,jR^i,j)xi,k+λθj,k\frac{\partial L}{\partial \theta_{j, k}} = \sum_{i}^{I} (R_{i, j} - \hat{R}_{i, j})\cdot x_{i, k} + \lambda \theta_{j, k}

Remember that

R^i,j=xiθj\hat{R}_{i, j} = \vec{x_{i}} \cdot \vec{\theta_{j}}

Vectorized Approach

Recall that the output of our low-rank matrices model is R^\hat{R} and let's find the gradient of LL with respect to R^\hat{R} first. The 12\frac{1}{2} term will get canceled out by the square term.

LR^=R^R\frac{\partial L}{\partial \hat{R}} = \hat{R} - R
R^=XΘT\hat{R} = X\Theta^{T}

Now let's figure out the gradient of R^\hat{R} with respect to XX and Θ\Theta:

R^X=ΘT\frac{\partial \hat{R}}{\partial X} = \Theta^{T}
R^Θ=X\frac{\partial \hat{R}}{\partial \Theta} = X

Using chain rule, we can then derive the following results:

LX=LR^R^X\frac{\partial L}{\partial X} = \frac{\partial L}{\partial \hat{R}}\frac{\partial \hat{R}}{\partial X}
LΘ=LR^R^Θ\frac{\partial L}{\partial \Theta} = \frac{\partial L}{\partial \hat{R}}\frac{\partial \hat{R}}{\partial \Theta}

An example implemented using numpy. We set gradient of model prediction output to zero if that particular entry is missing in the training rating matrix. It is simply saying we are ignoring nil entries' contribution to the loss function. We use zero to represent nil entries in the training R matrix.

Caution

In general vectorized approach is much faster than iterative approach but there is a huge memory constraint. When you run 260,000 users on 45,000 movies, the matrix size is 11,700,000,000 and assuming each matrix entry is a float64. That means it's taking up 93,600,000,000 bytes in memory for just one single matrix. I am pretty sure you don't have a machine that has 100+ GB in RAM. Thus, depending on dataset size, it is sometimes better to write an iterative approach using a static typed performant language (C, C++, Go, or Java/Scala.)

Implementation Details

Please look at the lowrank module inside this folder to find out more about the implementation details.

Gradient Check

Let's confirm that gradients are computed correctly first.

Real Data

png

Root Mean Squared Error

The 100,000 ratings dataset we use is for debugging purpose. The number of ratings we have in this dataset is far too small, the RMSE error is relatively high.

png

Last updated