Forward and Backward Automatic Gradient Calculation

Introduction

There are serval methods that are used to calculate the function gradients like symbolic differentiation, numerical differentiation and automatic differentiation. Symbolic differentiation is used in softwares like Mathematica, Maple, Maxima etc. However it has the problem of expression swell and need closed-form math functions, which makes it impractical to use for a lot engineering works. Numerical differentiation can work for all kinds of functions but it estimate the gradient approximately. Automatic differentiation evaluate the gradient exactly if the analytical expression is not needed. It also computes efficiently as it only requires one forward pass and one backward pass. It is widely used inside the modern deep learning frameworks like PyTorch and TensorFlow.

Although automatic differentiation has two working modes: forward mode and reverse mode (a.k.a backprop). Only reverse mode is widely known to the practitioners because it is efficient to calculate the wide Jacobian matrices. Recently, some new gradient estimation method[1] is proposed that leverages the automatic differentiation forward mode. It has the benefits of better memory efficiency comparing with the backward mode. In this blog, I'd like to write down the detailed math formula that is used in both forward mode and reverse mode, so practitioners can have a clear picture about what's going on under the hood.

Setup

xi+1=Li(xi) x^{i+1} = L^i(x^{i})

The neural network Layer LiL^i maps input xiRmix^i \in \Reals^{m^{i}} to output xi+1Rmi+1x^{i+1} \in \Reals^{m^{i+1}}, where the superscript ii indicates the layer number. Layer number ii has input xix^i who has dimension mim^{i}. Its output xi+1x^{i+1} is the input for layer number i+1i+1 who has dimension of mi+1m^{i+1}. The Jacobian matrix JiRmi+1×miJ^i \in \Reals^{m^{i+1} \times m^{i}} for the ithi^{th} layer is:

Ji=x1i+1x1ix1i+1x2i...x1i+1xmiix2i+1x1ix2i+1x2i...x2i+1xmii............xmi+1i+1x1ixmi+1i+1x2i...xmi+1i+1xmii J^i = \begin{Vmatrix} \frac{\partial x^{i+1}_1}{\partial x^{i}_1} & \frac{\partial x^{i+1}_1}{\partial x^{i}_2} & ... & \frac{\partial x^{i+1}_1}{\partial x^{i}_{m^{i}}} \\ \frac{\partial x^{i+1}_2}{\partial x^{i}_1} & \frac{\partial x^{i+1}_2}{\partial x^{i}_2} & ... & \frac{\partial x^{i+1}_2}{\partial x^{i}_{m^{i}}} \\ ... & ... & ... & ... \\ \frac{\partial x^{i+1}_{m^{i+1}}}{\partial x^{i}_1} & \frac{\partial x^{i+1}_{m^{i+1}}}{\partial x^{i}_2} & ... & \frac{\partial x^{i+1}_{m^{i+1}}}{\partial x^{i}_{m^{i}}} \\ \end{Vmatrix}

Assume the first layer i=1i=1 and the last layer i=Ni=N. Note, to calculate the gradient with respect to the layer parameters (weights WiW_i). We can modify this setup slightly to accommodate it. It is equivalent to treat the WiW_i as input and the layer ii output xi+1x^{i+1}, i.e. xi+1=Li(Wi)x^{i+1} = L^i(W^i). Assume the first input is starting with WiW_i, all the rest is the same as the setup above. Without loss of generality, we will only consider the gradient with respect to inputs xix^i in this blog.

Forward Mode

Jacobian matrix JiRmi+1×miJ^i \in \Reals^{m^{i+1} \times m^{i}} can be viewed as a linear map that converts a vector in lower layer to a vector in higher layer, which is usually known as the push-forward mapping. The input vector viv^i has dimension of mim^{i} and it is a linear combination of gradient of xix^i with respect to jthj^{th} input element xj1x^1_j.

vi=c1x1ix11x2ix11x3ix11...xmiix11+c2x1ix21x2ix21x3ix21...xmiix21+...++cm1x1ixm11x2ixm11x3ixm11...xmiixm11 v^i = c_1 \begin{vmatrix} \frac{\partial x^i_1}{\partial x^1_1} \\ \frac{\partial x^i_2}{\partial x^1_1} \\ \frac{\partial x^i_3}{\partial x^1_1} \\ ...\\ \frac{\partial x^i_{m^{i}}}{\partial x^1_1} \end{vmatrix} + c_2 \begin{vmatrix} \frac{\partial x^i_1}{\partial x^1_2} \\ \frac{\partial x^i_2}{\partial x^1_2} \\ \frac{\partial x^i_3}{\partial x^1_2} \\ ...\\ \frac{\partial x^i_{m^{i}}}{\partial x^1_2} \end{vmatrix} + ... + + c_{m^{1}} \begin{vmatrix} \frac{\partial x^i_1}{\partial x^1_{m^1}} \\ \frac{\partial x^i_2}{\partial x^1_{m^1}} \\ \frac{\partial x^i_3}{\partial x^1_{m^1}} \\ ...\\ \frac{\partial x^i_{m^{i}}}{\partial x^1_{m^1}} \end{vmatrix}

Applying Jacobian JiJ^i, we have

Jivi=c1Jix1ix11x2ix11x3ix11...xmiix11+c2Jix1ix21x2ix21x3ix21...xmiix21+...++cm1Jix1ixm11x2ixm11x3ixm11...xmiixm11=c1x1i+1x11x2i+1x11x3i+1x11...xmi+1i+1x11+c2x1i+1x21x2i+1x21x3i+1x21...xmi+1i+1x21+...++cm1x1i+1xm11x2i+1xm11x3i+1xm11...xmi+1i+1xm11=vi+1 \begin{split} J^i v^i &= c_1 J^i \begin{vmatrix} \frac{\partial x^i_1}{\partial x^1_1} \\ \frac{\partial x^i_2}{\partial x^1_1} \\ \frac{\partial x^i_3}{\partial x^1_1} \\ ...\\ \frac{\partial x^i_{m^{i}}}{\partial x^1_1} \end{vmatrix} + c_2 J^i \begin{vmatrix} \frac{\partial x^i_1}{\partial x^1_2} \\ \frac{\partial x^i_2}{\partial x^1_2} \\ \frac{\partial x^i_3}{\partial x^1_2} \\ ...\\ \frac{\partial x^i_{m^{i}}}{\partial x^1_2} \end{vmatrix} + ... + + c_{m^{1}} J^i \begin{vmatrix} \frac{\partial x^i_1}{\partial x^1_{m^1}} \\ \frac{\partial x^i_2}{\partial x^1_{m^1}} \\ \frac{\partial x^i_3}{\partial x^1_{m^1}} \\ ...\\ \frac{\partial x^i_{m^{i}}}{\partial x^1_{m^1}} \end{vmatrix} \\ &= c_1 \begin{vmatrix} \frac{\partial x^{i+1}_1}{\partial x^1_1} \\ \frac{\partial x^{i+1}_2}{\partial x^1_1} \\ \frac{\partial x^{i+1}_3}{\partial x^1_1} \\ ...\\ \frac{\partial x^{i+1}_{m^{i+1}}}{\partial x^1_1} \end{vmatrix} + c_2 \begin{vmatrix} \frac{\partial x^{i+1}_1}{\partial x^1_2} \\ \frac{\partial x^{i+1}_2}{\partial x^1_2} \\ \frac{\partial x^{i+1}_3}{\partial x^1_2} \\ ...\\ \frac{\partial x^{i+1}_{m^{i+1}}}{\partial x^1_2} \end{vmatrix} + ... + + c_{m^{1}} \begin{vmatrix} \frac{\partial x^{i+1}_1}{\partial x^1_{m^1}} \\ \frac{\partial x^{i+1}_2}{\partial x^1_{m^1}} \\ \frac{\partial x^{i+1}_3}{\partial x^1_{m^1}} \\ ...\\ \frac{\partial x^{i+1}_{m^{i+1}}}{\partial x^1_{m^1}} \end{vmatrix} \\ &= v^{i+1} \end{split}

Apply this forward step layer by layer, for the last layer NN, we get vN+1v^{N+1}

vN+1=c1x1N+1x11x2N+1x11x3N+1x11...xmiN+1x11+c2x1N+1x21x2N+1x21x3N+1x21...xmiN+1x21+...++cm1x1N+1xm11x2N+1xm11x3N+1xm11...xmiN+1xm11 v^{N+1} = c_1 \begin{vmatrix} \frac{\partial x^{N+1}_1}{\partial x^1_1} \\ \frac{\partial x^{N+1}_2}{\partial x^1_1} \\ \frac{\partial x^{N+1}_3}{\partial x^1_1} \\ ...\\ \frac{\partial x^{N+1}_{m^{i}}}{\partial x^1_1} \end{vmatrix} + c_2 \begin{vmatrix} \frac{\partial x^{N+1}_1}{\partial x^1_2} \\ \frac{\partial x^{N+1}_2}{\partial x^1_2} \\ \frac{\partial x^{N+1}_3}{\partial x^1_2} \\ ...\\ \frac{\partial x^{N+1}_{m^{i}}}{\partial x^1_2} \end{vmatrix} + ... + + c_{m^{1}} \begin{vmatrix} \frac{\partial x^{N+1}_1}{\partial x^1_{m^1}} \\ \frac{\partial x^{N+1}_2}{\partial x^1_{m^1}} \\ \frac{\partial x^{N+1}_3}{\partial x^1_{m^1}} \\ ...\\ \frac{\partial x^{N+1}_{m^{i}}}{\partial x^1_{m^1}} \end{vmatrix}

In general, the forward mode evaluates a Jacobian-vector product JvJv.

Backward Mode

Jacobian matrix transpose JiTRmi×mi+1{J^i}^{T} \in \Reals^{m^{i} \times m^{i+1}} can be viewed as a linear map that converts a vector in high layer to a vector in lower layer, which is usually known as the push-back mapping. The input vector vi+1v^{i+1} has the dimension of mi+1m^{i+1} and it is a linear combination of gradient of jthj^{th} element xjN+1x^{N+1}_j with respect to xi+1x^{i+1}.

vi+1=c1x1N+1x1i+1x1N+1x2i+1x1N+1x3i+1...x1N+1xmi+1i+1+c2x2N+1x1i+1x2N+1x2i+1x2N+1x3i+1...x2N+1xmi+1i+1+...++cmN+1xmN+1N+1x1i+1xmN+1N+1x2i+1xmN+1N+1x3i+1...xmN+1N+1xmi+1i+1 v^{i+1} = c_1 \begin{vmatrix} \frac{\partial x^{N+1}_1}{\partial x^{i+1}_1} \\ \frac{\partial x^{N+1}_1}{\partial x^{i+1}_2} \\ \frac{\partial x^{N+1}_1}{\partial x^{i+1}_3} \\ ...\\ \frac{\partial x^{N+1}_1}{\partial x^{i+1}_{m^{i+1}}} \end{vmatrix} + c_2 \begin{vmatrix} \frac{\partial x^{N+1}_2}{\partial x^{i+1}_1} \\ \frac{\partial x^{N+1}_2}{\partial x^{i+1}_2} \\ \frac{\partial x^{N+1}_2}{\partial x^{i+1}_3} \\ ...\\ \frac{\partial x^{N+1}_2}{\partial x^{i+1}_{m^{i+1}}} \end{vmatrix} + ... + + c_{m^{N+1}} \begin{vmatrix} \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i+1}_1} \\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i+1}_2} \\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i+1}_3} \\ ...\\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i+1}_{m^{i+1}}} \end{vmatrix}

Applying Jacobian JiT{J^i}^{T}, we have

JiTvi+1=c1JiTx1N+1x1i+1x1N+1x2i+1x1N+1x3i+1...x1N+1xmi+1i+1+c2JiTx2N+1x1i+1x2N+1x2i+1x2N+1x3i+1...x2N+1xmi+1i+1+...++cmN+1JiTxmN+1N+1x1i+1xmN+1N+1x2i+1xmN+1N+1x3i+1...xmN+1N+1xmi+1i+1=c1x1N+1x1ix1N+1x2ix1N+1x3i...x1N+1xmii+c2x2N+1x1ix2N+1x2ix2N+1x3i...x2N+1xmii+...++cmN+1xmN+1N+1x1ixmN+1N+1x2ixmN+1N+1x3i...xmN+1N+1xmii=vi \begin{split} {J^i}^{T} v^{i+1} &= c_1 {J^i}^{T} \begin{vmatrix} \frac{\partial x^{N+1}_1}{\partial x^{i+1}_1} \\ \frac{\partial x^{N+1}_1}{\partial x^{i+1}_2} \\ \frac{\partial x^{N+1}_1}{\partial x^{i+1}_3} \\ ...\\ \frac{\partial x^{N+1}_1}{\partial x^{i+1}_{m^{i+1}}} \end{vmatrix} + c_2 {J^i}^{T} \begin{vmatrix} \frac{\partial x^{N+1}_2}{\partial x^{i+1}_1} \\ \frac{\partial x^{N+1}_2}{\partial x^{i+1}_2} \\ \frac{\partial x^{N+1}_2}{\partial x^{i+1}_3} \\ ...\\ \frac{\partial x^{N+1}_2}{\partial x^{i+1}_{m^{i+1}}} \end{vmatrix} + ... + + c_{m^{N+1}} {J^i}^{T} \begin{vmatrix} \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i+1}_1} \\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i+1}_2} \\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i+1}_3} \\ ...\\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i+1}_{m^{i+1}}} \end{vmatrix} \\ &= c_1 \begin{vmatrix} \frac{\partial x^{N+1}_1}{\partial x^{i}_1} \\ \frac{\partial x^{N+1}_1}{\partial x^{i}_2} \\ \frac{\partial x^{N+1}_1}{\partial x^{i}_3} \\ ...\\ \frac{\partial x^{N+1}_1}{\partial x^{i}_{m^{i}}} \end{vmatrix} + c_2 \begin{vmatrix} \frac{\partial x^{N+1}_2}{\partial x^{i}_1} \\ \frac{\partial x^{N+1}_2}{\partial x^{i}_2} \\ \frac{\partial x^{N+1}_2}{\partial x^{i}_3} \\ ...\\ \frac{\partial x^{N+1}_2}{\partial x^{i}_{m^{i}}} \end{vmatrix} + ... + + c_{m^{N+1}} \begin{vmatrix} \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i}_1} \\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i}_2} \\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i}_3} \\ ...\\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i}_{m^{i}}} \end{vmatrix} \\ &= v^i \end{split}

Apply this backward step layer by layer, for the first layer 11, we get v1v^1

v1=c1x1N+1x11x1N+1x21x1N+1x31...x1N+1xm11+c2x2N+1x11x2N+1x21x2N+1x31...x2N+1xm11+...++cmN+1xmN+1N+1x11xmN+1N+1x21xmN+1N+1x31...xmN+1N+1xm11 v^1 = c_1 \begin{vmatrix} \frac{\partial x^{N+1}_1}{\partial x^{1}_1} \\ \frac{\partial x^{N+1}_1}{\partial x^{1}_2} \\ \frac{\partial x^{N+1}_1}{\partial x^{1}_3} \\ ...\\ \frac{\partial x^{N+1}_1}{\partial x^{1}_{m^{1}}} \end{vmatrix} + c_2 \begin{vmatrix} \frac{\partial x^{N+1}_2}{\partial x^{1}_1} \\ \frac{\partial x^{N+1}_2}{\partial x^{1}_2} \\ \frac{\partial x^{N+1}_2}{\partial x^{1}_3} \\ ...\\ \frac{\partial x^{N+1}_2}{\partial x^{1}_{m^{1}}} \end{vmatrix} + ... + + c_{m^{N+1}} \begin{vmatrix} \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{1}_1} \\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{1}_2} \\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{1}_3} \\ ...\\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{1}_{m^{1}}} \end{vmatrix}

In general, the backward mode evaluates a transposed Jacobian-vector product JTvJ^Tv, or sometimes is expressed as vTJv^T J.

References

  1. Baydin, Atılım Güneş, et al. "Gradients without Backpropagation." arXiv preprint arXiv:2202.08587 (2022).
Written on July 14, 2022