In this section, we will recap some equations which were derived in detail in the post on back propagation in fully connected layer [1]. These equations are essential to understand the back propagation in RNNs, including LSTMs. Consider a fully connected network with $tanh$ as the non-linear activation e.g.
If $L$ is the loss of the network and given $\frac{\partial L}{\partial Z}$ (from the preceding layer), then
Now consider a many-to-one Recurrent Neural Network (RNN) constructed using a single vanilla RNN cell (not LSTM cell). The computational graph of the network unrolled in time is as shown below.
Let $H_0$, be the initial hidden vector. Let $H_1$, $H_2$ and $H_3$ be the hidden vectors and $X_1$, $X_2$ and $X_3$ be the input vectors at times $t=1$, $t=2$ and $t=3$ respectively. Let $Y$, be the output vector of the network at time $t=3$. Let $h$, $x$ and $y$ be the sizes of the column vectors $H$, $X$ and $Y$ respectively, and $W$, $U$, $V$ be the learnable matrices of sizes $h \times h$, $h \times x$ and $y \times h$ respectively. Let $W_1 = W_2 = W_3 = W$ and $U_1 = U_2 = U_3 = U$ be dummy variables introduced to make the back propagation equations easier to write. The equations governing the forward propagation are given by,
Let $L$, be the loss(error) of the network we want to minimize. $L$ is a function of $Y$ and the ground truth. During backward propagation, given $\frac{\partial L}{\partial Y}$ we want to calculate $\frac{\partial L}{\partial W}$, $\frac{\partial L}{\partial U}$ and $\frac{\partial L}{\partial V}$. Note that these partial derivatives are matrices of the same shape as $Y$, $W$, $U$ and $V$ respectively, and are used to update the matrices $W$, $U$ and $V$ respectively, e.g.
Now,
Therefore, we need to calculate $\frac{\partial L}{\partial H_1}$, $\frac{\partial L}{\partial H_2}$ and $\frac{\partial L}{\partial H_3}$. Now,
Substituting the values of $\frac{\partial L}{\partial H_1}$, $\frac{\partial L}{\partial H_2}$ and $\frac{\partial L}{\partial H_3}$ gives us the values of $\pd{L}{W}$ and $\pd{L}{U}$ as a function of $W, V, H_0, H_1, H_2, H_3, X_1, X_2, X_3$ and $\pd{L}{Y}$.
May be in the future.