y[t]and the loss for every time step. At the end we simply sum up the loss of all the time steps and count that as our total loss of the network.
x[t]and time step of a hidden state is represented by
h[t]. Thus we can think of
h[t - 1]as the previous hidden state. The production of hidden state is simply a matrix muplitication of input and hidden state by some weights W.
h[t], we need some weight matrices,
x[t]and a non-linearity
y[t]at every timestep, thus, we need another weight matrix that accepts a hidden state and project it to an output.
grad_y. Now we are tasked with calculating the following gradients:
Please look at Character-level Language Model below for detailed backprop example
['h', 'e', 'l', 'o']. An example training sequence is
hello. The same output from hidden layer is being fed to output layer and the next hidden layer, as noted below that
y[t]is a product of
h[t]. Since we know what we are expecting, we can backpropagate the cost and update weights.
y[t]is a prediction for which letter is most likely to come next. For example, when we feed
hinto the network,
eis the expected output of the network because the only training example we have is
tanhexample we had up there to implement a single layer recurrent nerual network. The forward pass is quite easy. Assuming the input is a list of character index, i.e.
a => 0,
b => 1, etc..., the target is a list of character index that represents the next letter in the sequence. For example, the target is characters of the word
ensorflowand the input is
tensorflo. Given a letter
t, it should predict that next letter is
h, and the bias:
uand then use that to find rest of the gradients.
(hidden_dim, 2 * hidden_dim)matrix.