Vanilla Recurrent Neural Network
Last updated
Last updated
Recurrent neural network is a type of network architecture that accepts variable inputs and variable outputs, which contrasts with the vanilla feed-forward neural networks. We can also consider input with variable length, such as video frames and we want to make a decision along every frame of that video.
One-to-one
This is the classic feed forward neural network architecture, with one input and we expect one output.
One-to-many
This can be thought of as image captioning. We have one image as a fixed size input and the output can be words or sentences which are variable in length.
Many-to-one
This is used for sentiment classification. The input is expected to be a sequence of words or even paragraphs of words. The output can be a regression output with continuous values which represent the likelihood of having a positive sentiment.
Many-to-many
This model is ideal for machine translation like the one we see on Google translate. The input could an English sentence which has variable length and the output will be the same sentence in a different language which also has variable length. The last many to many model can be used for video classification on frame level. Feed every frame of a video into the neural network and expect an output right away. However, since frames are generally dependent on each other, it is necessary for the network to propagate its hidden state from the previous to the next. Thus, we need recurrent neural network for this kind of task.
Instead of imagining that hidden state is being recurrently fed back into the network, it's easier to visualize the process if we unroll the operation into a computational graph that is composed to many time steps. (The concept of hidden state and mathematical formulation will be explained in the next section.)
For example, we begin with a zero'ed vector as our hidden state on the left. We feed it into the network along with our first input. When we receive the next input, we take the new hidden state and feed it into the network again with the second input. The procoess goes on until the point we wish to compute the final output of the network.
We use the same set of weight for every time step of the computation.
For the many-to-many case, we compute a y[t]
and the loss for every time step. At the end we simply sum up the loss of all the time steps and count that as our total loss of the network.
When we think about the back propagation for this model, we will have a separate gradient for W flowing from each of those time steps and then the final gradient for W will be the sum of all those individual time step gradients. Imagine that we have some sort of ground-truth label for every step of the sequence:
If we have this many to one situation, we make the decision based on the final hidden state of this network. This final hidden state summarizes all of the context from the entire sequence.
If we have this one to many situation, where we want to receive a fixed size input and produce a variable length output, then you would commonly use that fixed size input to initialize the hidden state and then let the network to propagate and evolve the hidden state forward.
For the sequence to sequence models where you might want to do something like machine translation, this is a combination of many-to-one and one-to-many architecture. We proceed in two stages, (1) the encoder receives a variably sized input like an english sentence and performs encoding into a hidden state vector, (2) the decoder receives the hidden state vector and produces a variably sized output. The motivation of using this architecture is modularity. We can easily swap out encoder and decoder for different type of language translation.
We can process a sequence of vectors x applying a recurrence formula at every time step:
Time step of an input vector is represented by x[t]
and time step of a hidden state is represented by h[t]
. Thus we can think of h[t - 1]
as the previous hidden state. The production of hidden state is simply a matrix muplitication of input and hidden state by some weights W.
NOTE: The same function and same set of parameters are used at every time step.
Here's a simple one-to-many vanilla recurrent neural network example in functional form. If we were to produce h[t]
, we need some weight matrices, h[t-1]
, x[t]
and a non-linearity tanh
.
Since this is a one-to-many network, we'd want to produce an output y[t]
at every timestep, thus, we need another weight matrix that accepts a hidden state and project it to an output.
Using softmax loss and gradient of softmax loss for every time step, we can derive grad_y
. Now we are tasked with calculating the following gradients:
Please look at Character-level Language Model below for detailed backprop example
For recurrent neural network, we are essentially backpropagation through time, which means that we are forwarding through entire sequence to compute losses, then backwarding through entire sequence to compute gradients.
However, this becomes problematic when we want to train a sequence that is very long. For example, if we were to train a a paragraph of words, we have to iterate through many layers before we can compute one simple gradient step. In practice, what people do is an approximation called truncated backpropagation through time. Run forward and backward through chunks of the sequence instead of the whole sequence.
Even though our input sequence can potentially be very long or even infinite, when we are training our model, we will step forward for some number of steps and compute a loss only over this sub sequence of the data. Then backpropagate through this sub-sequence and make a gradient step on the weights. When we move to the next batch, we still have this hidden state from the previous batch of data, we will carry this hidden state forward. The forward pass is unaffected but we will only backpropgate again through this second batch.
Suppose that we have a character-level language model, the list of possible vocabularies is ['h', 'e', 'l', 'o']
. An example training sequence is hello
. The same output from hidden layer is being fed to output layer and the next hidden layer, as noted below that y[t]
is a product of W_hy
and h[t]
. Since we know what we are expecting, we can backpropagate the cost and update weights.
The y[t]
is a prediction for which letter is most likely to come next. For example, when we feed h
into the network, e
is the expected output of the network because the only training example we have is hello
.
At test time, we sample characters one at a time and feed it back to the model to produce a whole sequence of characters (which makes up a word.) We seed the word with a prefix like the letter h in this case. The output is a softmax vector which represents probability. We can use it as a probability distribution and perform sampling.
This means EACH character has some chance to be selected Samplng technique gives us more diversity in the output. This is evident in sentence construction. Given a prefix, we can have multiple words and phrases to represent the same idea.
Let's use the same tanh
example we had up there to implement a single layer recurrent nerual network. The forward pass is quite easy. Assuming the input is a list of character index, i.e. a => 0
, b => 1
, etc..., the target is a list of character index that represents the next letter in the sequence. For example, the target is characters of the word ensorflow
and the input is tensorflo
. Given a letter t
, it should predict that next letter is e
.
Now here's the fun part, computing the gradients for backpropagation. First of all, let's remind ourself what our model is.
First compute the gradient of loss with respect to output vector y
:
Then gradient of loss with respect to Why
, h
, and the bias:
We need to perform a little u-substitution here to simplify our derivatives.
So we find the gradient of loss with respect to u
and then use that to find rest of the gradients.
Finally, we can compute the gradients for the last two parameters:
We can construct a multi-layer recurrent neural network by stacking layers of RNN together. That is simply taking the output hidden state and feed it into another hidden layer as an input sequence and repeat that process. However, in general RNN does not go very deep due to the exploding gradient problem from long sequence of data. Also for most natural language problems, there isn't a lot of incentive to go deep for every time step. The key thing is long sequence data.
In this case, the W[l]
is a (hidden_dim, 2 * hidden_dim)
matrix.