Differentials 101 Extended to Tensors¶
By Eric Wong
Introduction¶
This notebook serves as a basic introduction to differential calculus, with a tensor twist. While there are plenty of resources on differential calculus (e.g., Matrix Differential Calculus by Magnus and Neudecker), the way they address tensor derivatives feels like a cop-out: the current solution is to vectorize the inputs and outputs, and use normal matrix differential calculus. This makes derivations involving higher order tensor derivatives to have convoluted forms: in order to use matrix differential calculus, often a number of unnecessary additional vec and kronecker product operations are required to finagle the problem into proper form.
This notebook attempts to do away with this vectorized reshaping, which is both more effort and harder to understand. Instead, we generalize the concept of matrix differential calculus to tensor differential calculus, to maintain proper, intuitive shapes.
Differential Calculus¶
Since differential calculus is not typically covered in most calculus courses, we do a brief overview here, starting from the familiar 1D case. The notation here closely follows that found in Matrix Differential Calculus.
Scalar Differential Calculus¶
Consider a first order Taylor expansion of $f$ around $a$:
$$f(x+a) = f(a) + f'(a)x + R_a(x)$$where $R_a(x)$ is the usual remainder term, and $f'(a)$ is the first derivative of $f$ evaluated at $a$.
Then, we will define the differential of $f$ around $a$ with stepsize $x$ to be $\mathrm df(a; x) = f'(a)x$. For simplicity, we will adopt the same notation to that of 1D calculus, and drop the $a$ term (e.g. writing $f'$ instead of $f'(a)$), so we have
$$\mathrm df(x) = f'\cdot x$$Note that the scalar $x$ has no restrictions on its actual value. In particular, this means that $x$ could be any another differential, e.g. $\mathrm dy$ or $\mathrm dx$. We will often omit it from the left hand side and have it defined implicitly by using $\mathrm dx$ on the right, so we write
$$\mathrm df = f'\cdot \mathrm dx$$So a differential is nothing more than the derivative times an independent scalar value $\mathrm dx$! Thanks to the calculus rules on derivatives and scalars, all the normal calculus rules also apply here, e.g. the product rule is
- $\mathrm d(f+g) = (f+g)'\cdot x = f'\cdot x + g'\cdot x = \mathrm df + \mathrm dg$
So assuming you already know calculus, the rules when working with differentials are exactly the same.
Lastly,
$$\mathrm df = f'\cdot \mathrm dx \Rightarrow \frac{\mathrm df}{\mathrm dx} = f'$$Look familiar? This is the same form as how we typically write derivatives, but with differentials instead. In fact, this definition goes both ways: If you can derive an equation of the form
$$\mathrm df = a\cdot dx$$then $a=\frac{df}{dx} = f'(x)$. This is how derivatives can be derived using differential calculus instead of standard calculus. An example of which follows:
Example: 1D differential¶
Consider $f = e^{2x + 1}$. The usual manner of calculating the derivative would be to use the chain rule and compute
$$\frac{df}{dx} = e^{2x+1}\cdot \frac{d}{dx}(2x+1) = 2e^{2x+1}$$Using the same rules but with differentials would result in
$$\mathrm df = e^{2x+1}\cdot \mathrm d(2x+1) = 2e^{2x+1}\mathrm dx$$so $f'(x) = 2e^{2x+1}$. So far there's no real difference in using 1D calculus as opposed to differentials, at least in the scalar case.
Differential Calculus rules¶
The basic rules for differential calculus are essentially the same as that for scalar calculus. The Matrix Cookbook has a number of them at the start of its derivatives section (see Eq. 33-45), but a few of the more basic ones are listed here. If $A$ is a constant and $X,Y$ are functions, then
- $\mathrm d(AX) = A\mathrm d(X)$
- $\mathrm d(X+Y) = \mathrm dX + \mathrm dY$
- $\mathrm d(XY) = \mathrm dX\cdot Y + X\cdot \mathrm dY$
- $\mathrm d(X^T) = (\mathrm dX)^T$
Lastly we have the all important chain rule. If $h = f\circ g$ then:
- $\mathrm dh = \mathrm df(g; \mathrm dg)$
This might look a little strange, but this is a very natural in practice. Recall that the second argument is simply a right multiply. Reading this from left to right, it is equivalent to 'taking the derivative of $f$ evaluated at $g(x)$ multiplied by the differential of $g(x)$. This is exactly the chain rule of the scalar case, which is simply
$$\mathrm dh = f'(g)\mathrm dg$$Chain rule example¶
Let's use the 1D least squares as an example: $f(x) = (\alpha x + b - y)^2$. Then:
$$\mathrm df = \mathrm d((\alpha x + b - y)^2) = 2(\alpha x + b - y)\mathrm d(\alpha x + b - y) = 2(\alpha x + b - y)\alpha \mathrm dx$$Vector Differential Calculus¶
Next we consider functions that take in vector arguments and output vector values. Consider $f$ that outputs $n$ different values $f_1,\dots, f_n$ and takes in $m$ different variables $x_1,\dots, x_m$.
$$f(x) = (f_1(x), \dots, f_n(x)) = (f_1(x_1,\dots, x_m), \dots, f_n(x_1,\dots, x_m))$$This utilizes partial derivatives, and these are typically arranged in a matrix known as the Jacobian $J$ arranged as follows:
$$[J]_{ij} = \frac{\partial f_i}{\partial x_j}$$So each row corresponds to a scalar output $f_i$, and each column corresponds to a scalar input $x_j$
The corresponding vector Taylor expansion is simply
$$f(x+a) = f(a) + J(a)\cdot x + R_a(x)$$where we now use the Jacobian instead of the derivative in the scalar case. Naturally, the corresponding differential is
$$\mathrm df = J\cdot x$$Since $x$ can be arbitrarily valued, we can replace it with the differential $\mathrm dx$ and get
$$\mathrm df = J\cdot \mathrm dx$$We get the same rules for vector differentials as we do for vector calculus. This includes the chain rule $\mathrm dh = \mathrm df(g, \mathrm dg)$. Recall that the chain rule in the scalar case simplified to $\mathrm dh = f'(g)\mathrm dg$. The corresponding vector case (letting $J_f$ be the Jacobian of $f$) is $$\mathrm dh = J_f(g)\mathrm dg$$
A common use case of the above scenario is dealing with sums of an element-wise function $f$ (e.g. summing a loss over datapoints), which can be represented as dot product with a 1 vector. Then:
$$\mathrm d(1^Tf(x)) = 1^T\mathrm d(f(x)) = 1^T(f'(x)\odot \mathrm dx) = f'(x)^T\mathrm dx$$Example: scalar output, vector valued parameters¶
Consider a logistic function on a linear model, $ f(x) = \exp(-\theta^T x)$. Then the usual gradient calculuation would be
$$\nabla f(x) = \exp(-\theta^T x)\cdot \nabla(-\theta^T x) = -\exp(-\theta^T x)\cdot\theta$$The corresponding differential calculation would be
$$\mathrm df(x) = \exp(-\theta^T x)\cdot \mathrm d(-\theta^T x) = -\exp(-\theta^T x)\cdot\theta^T \mathrm dx$$Behind the scenes, we can start to see the simplification that differentials give us. When we compute the gradient of $\theta^Tx$, we see the following:
- We look through the possible rules from The Matrix Cookbook and find one for $\frac{\partial a^Tx}{\partial x} = a$. Note that adjacent to this rule, there is also a rule for $\frac{\partial x^Ta}{\partial x} = a$
On the other hand the differential calculation is fairly straightforward:
- We apply the normal differential with a constant product rule ($\mathrm d(af) = a\mathrm df$ for scalar $a$).
The problem that we will see with normal vector calculus is that we apply similar rules to those in scalar calculus, but not consistently. In order to characterize all possibilities of matrix vector combinations, a large portion of the derivatives section of The Matrix Cookbook involves many similar looking rules but with a small tweak, for example a rule for $\frac{\partial a^TXb}{\partial X}$ and a separate rule for $\frac{\partial a^TX^Tb}{\partial X}$.
Example: vector output, vector valued parameters¶
Let's consider a very simple example with vector output and vector valued parameters. Let $f(x) = (x^T\theta)\cdot v$ for someconstant vectors $\theta, v$. One approach is to just consider the $i$th partial derivative with respect to the $j$th input:
$$\frac{\partial f_i}{\partial x_j} = \frac{\partial}{\partial x_j}(x^T\theta \cdot v_i) = \theta_j v_i$$So the entire Jacobian of partial derivatives can be formed with
$$J = v \cdot \theta^T$$With differentials we get the usual story:
$$\mathrm df = \mathrm d((x^T\theta)\cdot v) = \mathrm dx^T\theta \cdot v = v \cdot \theta^T \mathrm dx$$Which produces the same Jacobian.
Shapes of differentials¶
One major advantage of differentials in interpretability is that a function and its differential have exactly the same shape. For example, the product rule $\mathrm d(xy) = \mathrm dx\cdot y + x\cdot \mathrm dy$ structurally makes sense since $\mathrm dx$ and $x$ have the same shape, and $\mathrm dy$ and $y$ have the same shape. This is true in general: for an arbitrarily shaped function $f$, the shape of $f$ and $\mathrm df$ are equivalent.
In contrast, partial derivatives do not have this property at all past 1 dimension. For example, the gradient of a scalar function is a vector, and the Jacobian of a vector valued function is a matrix. This causes problems when mixing different shapes and attempting to apply standard calculus rules, which is commonly tackled by taking an elementwise partial derivative and reconstructing the Jacobian from its individual entries.
Example: Product Rule for Scalar and Vector Valued Functions.¶
Suppose we have a scalar valued function $f : \mathbb{R}^k\rightarrow \mathbb{R}$ and a vector valued function $g : \mathbb{R}^k \rightarrow \mathbb{R}^n$. Consider the product $h(x) = (f\cdot g)(x) = f(x)g(x)$. What is the product rule here?
The standard product rule for matrix calculus is not immediately clear. Note that $f(x)g(x) = g(x)f(x)$ and so depending on how we (possibly incorrectly) apply a gradient product rule we could end up with an vector outer product ($g(x)\nabla f(x)^T$), a vector inner product ($g(x)^T\nabla f(x)$), or a vector element-wise product $g(x) \odot \nabla f(x)$. So we proceed element-wise:
$$ \frac{\partial (f\cdot g)_i}{\partial x_j} = \frac{\partial}{\partial x_j} (f(x)g_i(x) = (\nabla f(x))_jg_i(x) + f(x)(\nabla g_i(x))_j$$Piecing this together, we get that the Jacobian is
$$J_h = g(x)\cdot \nabla f(x)^T + f(x)\cdot J_g(x)$$And now the differential approach, using first the product rule (notice how the unchanging shape of the differential allows us to apply a straightforward product rule):
$$\mathrm d(f\cdot g) = \mathrm d(f(x)) \cdot g(x) + f(x) \cdot \mathrm d(g(x))$$Followed by the chain rule:
$$\mathrm d(f\cdot g) = (\nabla f(x)^T dx)\cdot g(x) + f(x)\cdot J_g(x)\mathrm dx = \left(g(x) \cdot \nabla f(x)^T + f\cdot J_g(x)\right)\mathrm dx$$'Matrix' Differential Calculus¶
When we have matrix valued functions or matrix arguments, there are several approaches:
- Proceed with element-wise operations and reconstruct the full derivative
- vectorize the input and output and calculate the normal 2D Jacobian
- Operate on rows and columns to get 2D subcases
The book on Matrix Differential Calculus takes the second option and vectorizes matrices to work in the vector case. A lot of takes on explaining neural network backpropogation take the first approach. Let's consider softmax loss as an example for the third option. Let $\theta$ be a $m\times k$ matrix of parameters for $k$ classes, and let $x$ be an example with $m$ features. Let $y$ be the one hot encoding of the true class label.
The probability of each class can be expressed as
$$\exp(\theta^Tx)/(1^T\exp(\theta^Tx))$$Then our typical loss can be expressed as the negative log probability of the correct class
$$l(\theta) = -y^T\log(\exp(\theta^Tx)/(1^T\exp(\theta^Tx))) = -y^T\theta^Tx + \log(1^T\exp(\theta^Tx))$$and the typical scenario is to calculate the gradient with respect to $\theta$ to perform gradient descent. So a typical solution is to compute the gradient with respect to $\theta_i$, the $i$th column of the $\theta$ matrix.
Standard Approach¶
Starting with the first step, using a few rules from the Matrix Cookbook we can arrive at
$$\nabla_{\theta_i}l(\theta) = -1(y_i = 1)x + \frac{1}{1^T\exp(\theta^Tx)}\nabla_{\theta_i}(1^T\exp(\theta^Tx))$$Further application of the chain rule results in
$$\nabla_{\theta_i}l(\theta) = \left(-1(y_i = 1) + \frac{\exp(\theta_i^Tx)}{1^T\exp(\theta^Tx)}\right) x$$Piecing these vector gradients together produces the full matrix gradient.
$$\frac{\partial l(\theta)}{\partial\theta} = x\left(-y + \frac{\exp(\theta^Tx)}{1^T\exp(\theta^Tx)}\right)^T$$Matrix Differential Calculus Redone¶
Consider a matrix valued function $f: \mathbb{R}^{N_1\times N_2} \rightarrow \mathbb{R}^{M_1\times M_2}$
Then, let's define the matrix version of the Jacobian as
$$J_f[n_1, n_2, m_1, m_2] = \frac{\partial f_{n_1,n_2}}{\partial x_{m_1,m_2}}$$Then, the differential of $f$ is
$$\mathrm df = J_f \cdot_T \mathrm dx$$Note that the chain rule is also the same, except we naturally replace the matrix multiply with a tensordot
, and further note that the shape of the differential is still the shape of the original function.
Differential Approach¶
The differential case for the softmax loss proceeds first by distributing the differential.
$$\mathrm dl = -y^T\mathrm d\theta^Tx + \mathrm d\left(\log(1^T\exp(\theta^Tx))\right)$$Then we use the chain rule.
$$\mathrm dl = -y^T\mathrm d\theta^Tx + \frac{\exp(\theta^Tx)^T}{1^T\exp(\theta^Tx)}\mathrm d\theta^Tx$$Lastly we rearrange using inner/outer tensordot
property, followed by the transpose invariant rule for tensordot
, and then factoring out $\mathrm d\theta$, which arrives at the same gradient (if these tensordot
operations are new, they are quite simple and are explained in Part 1).
Note that we did not have to case on the specific elements, rows, or columns, and were able to directly compute the matrix derivative.
Tensor Differential Calculus¶
We now have a very natural extension to tensor differential calculus. Consider a tensor valued function $f: \mathbb{R}^{N_1\times \dots \times N_k} \rightarrow \mathbb{R}^{M_1\times \dots \times M_l}$
Then, let's define the tensor version of the Jacobian as
$$J_f[n_1, \dots, n_k, m_1, \dots, m_l] = \frac{\partial f_{n_1,\dots, n_k}}{\partial x_{m_1,\dots, m_l}}$$Then, the differential of $f$ is
$$\mathrm df = J_f \mathrm \cdot_T dx$$And that's it! We can now do tensor differential calculus using all the same rules as before.
Returning to sub-matrix differentials¶
While performing full tensor differentials is often straightforward, it may still be desired to do elementwise or sub-matrix differential calculations. The differential form gives a very natural extension. For an arbitrary matrix valued function $f(x)$, suppose that $\mathrm df = J\cdot_T \mathrm dx$. Then
$$\mathrm [df]_{ij} = J_{ij}\cdot_T\mathrm dx = \mathrm df_{ij}$$So tensor differentials can be decomposed into its parts, or recomposed from its parts! The tensor version is identical, with just more indices.
Comments