Tensor Differential Calculus, Part 3

Tensor Differential Calculus Example: Backpropogation

By Eric Wong

This third and last notebook on the tensor differential calculus is a showcase example demonstrating how tensor differential calculus can be used to derive the update states for backpropogation in an elegant manner, without resorting to element-wise partial derivatives.

Problem setup

Consider a neural network with $k$ dense hidden layers and a softmax output layer. We'll use the following notation:

  • $h_k$ is the output of hidden layer $k$, so $h_k = g_k \circ a_k$. The input layer is just $h_0(x) = x$.
  • $a_k = W_k h_{k-1}$ is the pre-activation function of layer $k$
  • $g_k$ is the activation function of layer $k$, we'll use the sigmoid function.
  • $f$ is the output of the final layer, so with softmax this is $f = \textrm{softmax loss}(a_{k+1}, y) = -\log \frac{\exp([a_{k+1}]_y)}{\sum_i \exp([a_{k+1}]_y)}$, where $y$ is the true class label. If $y$ is the one hot encoding of the class label, then we can write it in matrix form as $\textrm{softmax loss}(\hat y, y) = -y^T\log \frac{\exp(\hat y)}{1^T\exp(\hat y)}$

This is the standard fully connected neural network with sigmoid activation and softmax loss.

Differential of the output layer

Starting with the top layer we have

$$f(x) = -y^T\log \frac{\exp(a_{k+1}(x))}{1^T\exp(a_{k+1}(x))} = -y^T a_{k+1}(x) + \log (1^T\exp(a_{k+1}(x)))$$

Using the chain rule, we get

$$\mathrm df = -y^T \mathrm da_{k+1} + \frac{\exp(a_{k+1}(x))^T\mathrm da_{k+1}}{1^T\exp(a_{k+1}(x))} = \left(-y^T + \frac{\exp(a_{k+1}(x))^T}{1^T\exp(a_{k+1}(x))}\right)\mathrm da_{k+1}$$

which gives the Jacobian of $f$ with respect to $a$, which we'll denote as $J_f(a_{k+1})$. Comparing this to a typical derivation (slide 29) we see two benefits:

  1. It is a simpler derivation
  2. The matrix form for deriving updates falls out naturally instead of having to piece it back together

Differential of the hidden layer output

Starting with the definition of the pre activation function, using differentials we have

$$\mathrm da_k = W_k\mathrm dh_{k-1}$$

Plugging this into the above for the $k+1$th layer, we have

$$\mathrm df = J_f(a_{k+1})W_{k+1}\mathrm dh_k$$

which gives the Jacobian of $f$ with respect to $h_k$, which we'll denote as $J_f(h_k)$.

Differential of the hidden layer pre-activation

Starting with the definition of $h_k$, using the chain rule for differentials we have

$$\mathrm dh_k = g'(a_k) \odot \mathrm da_k = \textrm{diag}(g'(a_k))\mathrm da_k$$

Where $\text{diag}$ is a diagonal matrix operation. Plugging this into the above, we have

$$\mathrm df = J_f(h_k)\textrm{diag}(g'(a_k))\mathrm da_k$$

which gives the Jacobian of $f$ with respect to $a_k$, which we'll denote as $J_f(h_k)$

By alternating these two differential definitions, we can compute the Jacobian of any layer with respect to $a_i$ and $h_i$.

Differential of the Weights

Starting with the definition of $a_k$, we have

$$\mathrm da_k = dW_k h_{k-1}$$

Plugging this into $\mathrm df$ we get

$$\mathrm df = J_f(a_k)dW_k h_{k-1} = J_f(a_k)^Th_{k-1}^T\cdot_T\mathrm d W_k$$

which gives the Jacobian of $f$ with respect to $W_k$, which we'll denote as $J_f(W_k)$. Notice how the tensor differential form allowed us to compute the Jacobian with respect to a matrix! Since $f$ is a scalar, $J_f(a_k)^T = \nabla_{a_k} f$, if that makes the form more familiar.

Differential of the Biases

We did not include any bias terms in the above (they can be included by simply extending the input with a feature that is always 1). However, if you want to include it explicitly, it is not too difficult. Sincet this just changes the activation function to $a_k = W_kh_{k-1} + b_k$, we get the following differential:

$$\mathrm da_k = \mathrm db_k$$

So $J_f(a_k) = J_f(b_k)$.

And that's it!

Comments