LSTM Recurrent Neural Network

Exploding Gradient in Vanilla RNN

Recall that our RNN model:

ht=tanh(Whhht1+Wxhxt)h_{t} = tanh \left( W_{hh} h_{t-1} + W_{xh}x_{t} \right)
ht=tanh((WhhWxh)(ht1xt))h_{t} = tanh \begin{pmatrix} (W_{hh} W_{xh}) \begin{pmatrix} h_{t-1} \\ x_{t} \end{pmatrix} \end{pmatrix}
ht=tanh(W(ht1xt))h_{t} = tanh \begin{pmatrix} W \begin{pmatrix} h_{t-1} \\ x_{t} \end{pmatrix} \end{pmatrix}

For every time step of a sequence, we backprogate from h[t] to h[t-1]. First the gradient will flow through the tanh gate and then to matrix multiplication gate. As we know, whenever we backprop into matrix multiplication gate, the upstream gradient is multiplied by the tranpose of the W matrix. This happens at every time step throughout the sequence. What if the sequence is very long?

rnn-gradient-flow

The final expression for gradient on h[0] will involve many factors of this weight matrix. This will either lead to an exploding gradient problem or vanishing gradient problem. There' a simple hack to address this problme, which is using numpy.clip. However, if the problem is vanishing gradient, clipping isn't going to help.

Introducing LSTM

LSTM has a fancier recurrence relation than the vanilla RNN. LSTM has two states, one is being the usual hidden state h[t] we see in vanilla RNN and another one is called the cell state c[t]. Cell state is an internal vector that is not exposed to the outside world.

Let's define some terminologies here:

  • f forget gate: whether to erase cell

  • i input gate: whether to write to cell

  • g gate gate: how much to write to cell

  • o output gate: how much to reveal cell

(ifog)=(σσσtanh)W(ht1xt)\begin{pmatrix} i \\ f \\ o \\ g \end{pmatrix} = \begin{pmatrix} \sigma \\ \sigma \\ \sigma \\ tanh \end{pmatrix} W \begin{pmatrix} h_{t - 1} \\ x_{t} \end{pmatrix}

Note: The sigma symbol represents sigmoid activation function.

The cell state is defined as following:

ct=fct1+igct=σ(Whfht1+Wxfxt)ct1+σ(Whiht1+Wxixt)tanh(Whght1+Wxgxt)\begin{aligned} c_{t} &= f \odot c_{t - 1} + i \odot g \\ c_{t} &= \sigma(W_{hf} h_{t-1} + W_{xf} x_{t}) \odot c_{t-1} + \sigma(W_{hi} h_{t-1} + W_{xi} x_{t}) \odot tanh(W_{hg} h_{t-1} + W_{xg} x_{t}) \end{aligned}

And the hidden state is a function of the cell state:

ht=otanh(ct)=σ(Whhoht1+Wxhoxt)tanh(ct)\begin{aligned} h_{t} &= o \odot tanh(c_{t}) \\ &= \sigma \begin{pmatrix} W_{hho} h_{t-1} + W_{xho} x_{t} \end{pmatrix} \odot tanh(c_{t}) \end{aligned}

We take the previous cell state and hidden state as the inputs to our LSTM cell. The previous hidden state is combined with the input vector and multiply with the weight matrix to produce ifog. The forget gate multiplies element-wise with the previous cell state. The input and gate gate also multiply element wise. The two results are combined through sum elemenwise to produce a new cell state. The cell state is then squashed by a tanh and multiplied element-wise by the output gate to produce our next hidden state.

LSTM

I omitted biases in above equations. Also, in some literatures, people tend to omit the g gate. In a non-matrix form, including biases, we can express the internal cell equations in the following way:

ft=σ(Whfht1+Wxfx+bf)f_{t} = \sigma \left(W_{hf}h_{t-1} + W_{xf}x + b_{f}\right)
it=σ(Whiht1+Wxix+bi)i_{t} = \sigma \left(W_{hi}h_{t-1} + W_{xi}x + b_{i}\right)
ot=σ(Whoht1+Wxox+bo)o_{t} = \sigma \left(W_{ho}h_{t-1} + W_{xo}x + b_{o}\right)

And compute the cell states and hidden states in the following way:

ct=ftct1+ittanh(Wgxx+Wghht1+bg)c_{t} = f_{t} \odot c_{t-1} + i_{t} \odot \tanh \left( W_{gx}x + W_{gh}h_{t-1} + b_{g} \right)
ht=ottanh(ct)h_{t} = o_{t} \odot tanh \left(c_{t} \right)

LSTM Gradient Flow

Backpropagating from c[t] to c[t-1] is only element-wise multiplication by the f gate, and there is no matrix multiplication by W. The f gate is different at every time step, ranged between 0 and 1 due to sigmoid property, thus we have avoided of the problem of multiplying the same thing over and over again.

Backpropagating from h[t] to h[t-1] is going through only one single tanh nonlinearity rather than tanh for every single step.

cell state gradient flow

LSTM Forward Propagation

The forward propagation isn't all that different from the vanilla recurrent neural network, we just now have more variables. Suppose we take a mini-batch of data, of shape (N, T, D). N is our batch size, T is the size of the sequence, and D is the dimension of our input.

For example, I have a sentence, "hello world I am Calvin". We can treat this sentence as one input sequence with size T=5 because it has five words. The dimension of the input depends on how we represent each word. Before we feed the sentence into a RNN, each word of a sentence has to be converted to a word vector which has dimension of D.

Word Vector Representation

Forward Step & Sequence Example

When we pass a mini batch of sequences to the LSTM layer, we will run through the each word vector of each sequence through series of time step. In each time step, we perform the following forward propagation:

The cache is necessary for back propagation later. The forward propagation of a LSTM layer can be thought as breaking a sequence into time steps and feed each time step to the above code snippet.

LSTM Back Propagation

We are given the following upstream gradients:

Lct  ,  Lht\frac{\partial L}{\partial c_{t}} \;,\; \frac{\partial L}{\partial h_{t}}

The objective is to find ifog gate gradients and matrix weight gradients:

Li  ,  Lf  ,  Lo  ,  Lg  ,  LWx  ,  LWh  ,  Lb  ,  Lx  ,  Lh0\frac{\partial L}{\partial i} \;,\; \frac{\partial L}{\partial f} \;,\; \frac{\partial L}{\partial o} \;,\; \frac{\partial L}{\partial g} \;,\; \frac{\partial L}{\partial W_{x}} \;,\; \frac{\partial L}{\partial W_{h}} \;,\; \frac{\partial L}{\partial b} \;,\; \frac{\partial L}{\partial x} \;,\; \frac{\partial L}{\partial h_{0}}

Recall that ifog gates are declared as follows:

(ifog)=(σσσtanh)(xWx+ht1Wh+b)\begin{pmatrix} i \\ f \\ o \\ g \end{pmatrix} = \begin{pmatrix} \sigma \\ \sigma \\ \sigma \\ tanh \end{pmatrix} \left( \vec{x}W_{x} + \vec{h}_{t-1}W_{h} + b\right)

The expected output of above calculation is a matrix of shape (N, 4H) where N is the mini-batch size and H is the hidden dimension. Using what we have above, we can compute the next cell state as follows:

ct=fct1+igc_{t} = f \cdot c_{t-1} + i \cdot g

With cell state, we can compute the next hidden state:

ht=otanh(ct)h_{t} = o \cdot tanh\left(c_{t}\right)

Compute Gradients

Since we are given upstream gradients of loss with respect to output cell state and output hidden state, we will first compute gradients of hidden state with respect to cell state and gradient of new cell state with respect to old cell state.

htct=o(1tanh2(ct))\frac{\partial h_{t}}{\partial c_{t}} = o \cdot (1 - tanh^{2}(c_{t}))
ctct1=f\frac{\partial c_{t}}{\partial c_{t - 1}} = f

Now we can calculate gradient of loss with respect to previous cell state, which is a sum of contributions from both upstream gradients.

Lct1=Lctctct1+Lhthtctctct1\frac{\partial L}{\partial c_{t-1}} = \frac{\partial L}{\partial c_{t}} \frac{\partial c_{t}}{\partial c_{t-1}} + \frac{\partial L}{\partial h_{t}} \frac{\partial h_{t}}{\partial c_{t}} \frac{\partial c_{t}}{\partial c_{t-1}}

Proceed and compute gradients for ifog gates:

Li=Lctcti+Lhthtctcti\frac{\partial L}{\partial i} = \frac{\partial L}{\partial c_{t}} \frac{\partial c_{t}}{\partial i} + \frac{\partial L}{\partial h_{t}} \frac{\partial h_{t}}{\partial c_{t}} \frac{\partial c_{t}}{\partial i}
Lf=Lctctf+Lhthtctctf\frac{\partial L}{\partial f} = \frac{\partial L}{\partial c_{t}} \frac{\partial c_{t}}{\partial f} + \frac{\partial L}{\partial h_{t}} \frac{\partial h_{t}}{\partial c_{t}} \frac{\partial c_{t}}{\partial f}
Lo=Lhthto\frac{\partial L}{\partial o} = \frac{\partial L}{\partial h_{t}} \frac{\partial h_{t}}{\partial o}
Lg=Lctctg+Lhthtctctg\frac{\partial L}{\partial g} = \frac{\partial L}{\partial c_{t}} \frac{\partial c_{t}}{\partial g} + \frac{\partial L}{\partial h_{t}} \frac{\partial h_{t}}{\partial c_{t}} \frac{\partial c_{t}}{\partial g}

Continue to back-propagate and we are almost done! Compute the non-linearity for each of the ifog gates and then combine the gradients to form one combined matrix of shape (N, 4H). We will call this our activiation matrix A. Then the following gradients can be obtained:

LWx=LAAWx\frac{\partial L}{\partial W_{x}} = \frac{\partial L}{\partial A} \frac{\partial A}{\partial W_{x}}
LWh=LAAWh\frac{\partial L}{\partial W_{h}} = \frac{\partial L}{\partial A} \frac{\partial A}{\partial W_{h}}
Lb=LAAb\frac{\partial L}{\partial b} = \frac{\partial L}{\partial A} \frac{\partial A}{\partial b}
Lx=LAAx\frac{\partial L}{\partial x} = \frac{\partial L}{\partial A} \frac{\partial A}{\partial x}
Lh0=LAAh0\frac{\partial L}{\partial h_{0}} = \frac{\partial L}{\partial A} \frac{\partial A}{\partial h_{0}}

Backprop Step & Sequence Example

Using the cache from the forward propagation time step, we can compute gradients for each time step.

We will use the step function in the sequence function. We iterate through each hidden state over time in a reversed order, hence the name back propagation. Every backprop time step will give us a gradient value for each of our weights and biases. We accumulate them over the whole sequence.

Example Model

Here is a simple model that takes a matrix of word indices as inputs and produces a matrix of word indices. I made the model such that it can answer simple questions but don't expect it to be as smart as Alexa because it has no natural language model. All it does is that it takes a sequence of input and produces a sequence of output.

The archiecture is very simple, it has three layers

  1. Word Embedding Layer - It converts word indices into vector representations.

  2. LSTM Layer - It takes the word vectors and produces another set of word vectors.

  3. Affine Layer - It takes the outputs and convert them into softmax scores

The softmax score is used to classify word vectors into the appropriate word index.

Now we have our model defined, we will run a solver with Adam optimizer to train the model.

png

Since I have so little data, I am still surprised that it can actually answer some basic questions. However, it is clearly not answering the questions in an intelligent manner. When I asked, where is the sky?, it tries to tell me what color is it. That is because in my training data, I only have one data point that has something to do with the sky.

Last updated