Here I will derive the mathematical basics of why does spectral clustering work. I will break them into four parts. The first three parts will lay the required groundwork for the mathematics behind spectral clustering. The final part will be piecing everything together and show that why that spectral clustering works as intended.
Part 1 - Vector Derivative of a Matrix
Given matrices A and x, how to compute the following?
∂x∂xTAx
For example:
x=[x0,x1]
A=[a00a10a01a11]
Multiply them out and get the following expression:
xTAx=a00x02+a01x0x1+a10x0x1+a11x12
Think of taking derivative of x.T * A * x with respect to a vector as an element wise operation:
Lagrange multiplier is frequently used in classical mechanics to solve function optimization under constraints. Lagrangian mechanic is often used in non-relativistic quantum mechanics for particle physics, however that would require knowledge in path integral. In this section, I am sticking to the plain old Lagrange multipler to solve a simple constraint problem as an example.
Suppose we want to minimize or maximize a function f(x, y, z) while subjecting to a constraint function g(x, y, z) = k. The easier example one can think of is that you want to a fence around your house, but the constraint is that you have limited materials from Home Depot. You want to maximize the area enclosed by your fence. Then you can use Lagrange Multiplier to perform the calculation and find the opitmal solution.
Formulation
We define Lagrangian as follows:
L=f(x,y,z)−λ(g(x,y,z)−k)
The objective is to solve the following function, which I forgot what was the formal name for it:
ablax,y,z,λL(x,y,z,λ)=0
We will have four equations and four unknowns.
Example
Let's use a simple 2 dimensional example with only two variables, x and y:
f(x,y)=6x2+12y2
g(x,y)=x+y=90
Then the Lagrangian for this example is:
L=6x2+12y2−λ(x+y−90)
Compute gradient for our Lagrangian and set it to equal to zero:
ablax,y,λ=12x−λ24y−λ90−x−y=000
Then the solution is clear:
xyλ=6030720
Part 3 - Minimize xTAx
The objective here is to combine what we know from Part 1 and Part 2 to achieve the following.
argminxxTAx
We want to minimize the expression x * A.T * x under the following constraints.
x is normalizedxTx=1
symmetricA=AT
A is positive semidefinitexTAx≥0
Being positive semidefinite is an important quality, because if a matrix is definite or semidefinite positive, the vector, at which derivative of the expression is zero, has to be the solution for minimization. Now we have our constraints, we are ready to use Lagrange multiplier to minimize this expression.
L=xTAx−λ(xTx−1)
We will make an important assumption here, that A is a symmetric matrix. You will see that this is indeed the case later on.
∂x∂L=2Ax−2λx=0
∂λ∂L=1−xTx=0
Now solve for x. I will begin to use the vector notation here in case we forget that x is a vector, and it has always been a vector. I omitted the vector symbol to type less LaTeX code on my end but I must include it here to illustrate a point.
Ax=λx
The constraint equation gave us that
xTx=1
So what does this mean? It means that if you want to minimize the expression xTAx, x must be the eigenvectors of A! Here are couple important properties:
A is positive if all eigenvalues are positive.
A is semidefinite positive if all eigenvalues are either positive or zero.
All eigenvectors are the same size, they have a norm equal to 1.
The eigenvector corresponding to the smallest eigenvalue will give you the smallest possible value of Ax
In converse, eigenvector corresponding to the biggest eigenvalue will give you the maximum of Ax. However I am not 100% sure of this point, I need to run couple tests to verify it.
Part 4 - Similarity Graph
Adjacency Matrix
Suppose we are given a set of four vertices or nodes V = {v1, v2, v3, v4} and we want to construct a adjacency matrix as follows:
WV=5440454044510015
Each element of W represents the connectivity strength between vertix i and vertix j. For example, W[0][0] is the connection strength between vertix 0 and itself (we can assume that 5 is the maximum connectivity strength.) A zero value can represent that there is no connection. So far the numbers in the adjacency matrix seems something arbitrary because it is. In general, we should use a Gaussian Kernel to fill in the connectivity strength.
Wi,j=exp(2∗σ2−1∗∥xi−xj∥2)
Degree Matrix
The degree of a vertex represents its total connectivity.
di=Σj=0NWi,j
We can then construct a degree matrix as follows.
DV=d00000d10000d20000d3
Minimizing Relationship
Let's define A as a subset of V such that A⊂V. Also define A to be a set of vertices that are not in the set A. Then we can say that:
A∪A=V
Let's use a concrete example, this is a different example from above. Suppose we are given 6 vertices and here is their coordinates.
import matplotlib.pyplot as plt%matplotlib inlineVx = [0,0,1,5,6,6]Vy = [0,1,0,2,2,3]plt.scatter(Vx, Vy)plt.show()
Visually speaking, it is very clear to us that these points can be grouped into two clusters, We can say that everything in the bottom-left group belongs to the set A.
A={{0,0},{0,1},{1,0}}
And everything in the upper-right group belongs to the set of not A.
A={{5,2},{6,2},{6,3}}
We can designate a feature vector for these vertices, and this feature vector represents whether a vertex is in the set A or not using 1 to indicate positive and 0 to indicate negative.
fA=111000
Now the natural question to ask is, how does this work if we have more than 2 clusters? Suppose we want 6 clusters, then we simply create 6 feature vectors each represents one of the groups A, B, C, D, E, F.
Now we can say that the point of spectral clustering is to find feature vectors such that the relationship between points in set and points not in set are minimized. We can declare the following equation as our relationship.
R=ΣiΣjWi,j(fi−fj)2
Our objective is to find the correct feature vector(s) such that when two vertices share strong connection should be categorized into the same cluster, which in turn minimize the relationship expression. There exissts many solutions to this minimization as we shall see soon.
Since W is symmetric and looping through i is the same as looping through j, we can declare that the first term and third term of the above equation are equivalent.
Usually we call the term degree matrix minus adjacency matrix, the Laplacian. We can EASILY minimize the expression based on what we derived in Part 3. In conclusion, if we want to find the solutions (vectors f) such that it minimizes the connectivity and clustering relationship, we find the least dominant eigenvectors of the Laplacian matrix. This was mindblowing to me when I first learned about it.