Word2Vec
The idea of word2vec is actually quite simple. We want to train a 2-layer neural network to perform a fake task. The weights in the hidden layer will become our embedding vectors for our vocabulary in the corpus.
Fake Task
Given a specific word in the middle of a sentence, find the probability for every word in our vocabulary of being the nearby word.
For example, I have a sentence.
Live as if you were to die tomorrow and learn as if you were to live forever.
I pick tomorrow to be my center word. Now the neural network is supposed to tell me what is the probability of die being the nearby word of tomorrow. If it is properly trained, it should be close to one. Similarly if I ask the neural network, what is the probability of live being the nearby word of tomorrow, it should be close to zero.
Architecture
Input
The input is a word that gets translated into one-hot encoding. The encoding vector should have length of V
where V
represents the total vocabulary length. For example, if I have 10 unique words in my vocabulary, one of the words will be encoded as the following.
Output
Each output is a vector of same length V
; it contains the probability for all the words that they are the nearby word of the input center word. The output vector will contain float ranging from 0 to 1.
Lookup Table
Our neural network has two layers and two sets of weights. Suppose we have 1000 words in our vocabulary and 300 is our feature dimension. The first set of weights from the hidden layer will be our word vector lookup table after we finish training.
Implementation
Now we are ready to actually create the model for performing such task. Let's define our word vector feature length to be D=100
.
Training
Standard Training
The architecture is pretty straightforward, this is just a standard softmax gradient backpropgation. Suppose we are given the center word and a correct target context word, e.g. (Porsche, Boxster)
.
Once we have grad_w1
and grad_w2
, we can perform updates on these two matrices. Eventually grad_w1
will be our list of embedding vectors for all words.
Problem
However, if we have a giant corpus with one million unique vocabulary words, and our feature dimension is defined to be 100. We will have a matrix that has 100 million entries. This is not going to fit in memory. We will also have serious trouble with computing the matrix multiplication. We need a better technique than this.
Negative Sampling
In the original paper Distributed Representations of Words and Phrases and their Compositionality, the authors modified the optimization objective with a technique they called negative sampling, which causes each training sample to update only a small percentage of the model's weights instead of the whole weights.
Essentially we will select 1 positive example, e.g. (Porsche, Boxster)
and 5 random negative examples, e.g. (Porsche, since)
, (Porsche, styling)
, (Porsche, concept)
, (Porsche, engine)
and (Porsche, model)
, assuming that the window size is 2.
Objective Function
The new objective function is written as follows.
Here's an example
And update the weight matrix w1
and w2
with the gradients returned from above calculations. This way we only need to modify 6 rows at a time.
Selecting Negative Samples
We will use unigram distribution to select more frequent words to be our negative samples. The probability of selecting a word to be a negative sample can be summarized by the following equation.
Last updated