Introduction
There are serval methods that are used to calculate the function gradients like symbolic differentiation, numerical differentiation and automatic differentiation. Symbolic differentiation is used in softwares like Mathematica, Maple, Maxima etc. However it has the problem of expression swell and need closed-form math functions, which makes it impractical to use for a lot engineering works. Numerical differentiation can work for all kinds of functions but it estimate the gradient approximately. Automatic differentiation evaluate the gradient exactly if the analytical expression is not needed. It also computes efficiently as it only requires one forward pass and one backward pass. It is widely used inside the modern deep learning frameworks like PyTorch and TensorFlow.
Although automatic differentiation has two working modes: forward mode and reverse mode (a.k.a backprop). Only reverse mode is widely known to the practitioners because it is efficient to calculate the wide Jacobian matrices. Recently, some new gradient estimation method[1] is proposed that leverages the automatic differentiation forward mode. It has the benefits of better memory efficiency comparing with the backward mode. In this blog, I'd like to write down the detailed math formula that is used in both forward mode and reverse mode, so practitioners can have a clear picture about what's going on under the hood.
Setup
x i + 1 = L i ( x i )
x^{i+1} = L^i(x^{i})
x i + 1 = L i ( x i )
The neural network Layer L i L^i L i maps input x i ∈ R m i x^i \in \Reals^{m^{i}} x i ∈ R m i to output x i + 1 ∈ R m i + 1 x^{i+1} \in \Reals^{m^{i+1}} x i + 1 ∈ R m i + 1 , where the superscript i i i indicates the layer number. Layer number i i i has input x i x^i x i who has dimension m i m^{i} m i . Its output x i + 1 x^{i+1} x i + 1 is the input for layer number i + 1 i+1 i + 1 who has dimension of m i + 1 m^{i+1} m i + 1 . The Jacobian matrix J i ∈ R m i + 1 × m i J^i \in \Reals^{m^{i+1} \times m^{i}} J i ∈ R m i + 1 × m i for the i t h i^{th} i t h layer is:
J i = ∥ ∂ x 1 i + 1 ∂ x 1 i ∂ x 1 i + 1 ∂ x 2 i . . . ∂ x 1 i + 1 ∂ x m i i ∂ x 2 i + 1 ∂ x 1 i ∂ x 2 i + 1 ∂ x 2 i . . . ∂ x 2 i + 1 ∂ x m i i . . . . . . . . . . . . ∂ x m i + 1 i + 1 ∂ x 1 i ∂ x m i + 1 i + 1 ∂ x 2 i . . . ∂ x m i + 1 i + 1 ∂ x m i i ∥
J^i =
\begin{Vmatrix}
\frac{\partial x^{i+1}_1}{\partial x^{i}_1} & \frac{\partial x^{i+1}_1}{\partial x^{i}_2} & ... & \frac{\partial x^{i+1}_1}{\partial x^{i}_{m^{i}}} \\
\frac{\partial x^{i+1}_2}{\partial x^{i}_1} & \frac{\partial x^{i+1}_2}{\partial x^{i}_2} & ... & \frac{\partial x^{i+1}_2}{\partial x^{i}_{m^{i}}} \\
... & ... & ... & ... \\
\frac{\partial x^{i+1}_{m^{i+1}}}{\partial x^{i}_1} & \frac{\partial x^{i+1}_{m^{i+1}}}{\partial x^{i}_2} & ... & \frac{\partial x^{i+1}_{m^{i+1}}}{\partial x^{i}_{m^{i}}} \\
\end{Vmatrix}
J i = ∥ ∥ ∂ x 1 i ∂ x 1 i + 1 ∂ x 1 i ∂ x 2 i + 1 ... ∂ x 1 i ∂ x m i + 1 i + 1 ∂ x 2 i ∂ x 1 i + 1 ∂ x 2 i ∂ x 2 i + 1 ... ∂ x 2 i ∂ x m i + 1 i + 1 ... ... ... ... ∂ x m i i ∂ x 1 i + 1 ∂ x m i i ∂ x 2 i + 1 ... ∂ x m i i ∂ x m i + 1 i + 1 ∥ ∥
Assume the first layer i = 1 i=1 i = 1 and the last layer i = N i=N i = N . Note, to calculate the gradient with respect to the layer parameters (weights W i W_i W i ). We can modify this setup slightly to accommodate it. It is equivalent to treat the W i W_i W i as input and the layer i i i output x i + 1 x^{i+1} x i + 1 , i.e. x i + 1 = L i ( W i ) x^{i+1} = L^i(W^i) x i + 1 = L i ( W i ) . Assume the first input is starting with W i W_i W i , all the rest is the same as the setup above. Without loss of generality, we will only consider the gradient with respect to inputs x i x^i x i in this blog.
Forward Mode
Jacobian matrix J i ∈ R m i + 1 × m i J^i \in \Reals^{m^{i+1} \times m^{i}} J i ∈ R m i + 1 × m i can be viewed as a linear map that converts a vector in lower layer to a vector in higher layer, which is usually known as the push-forward
mapping. The input vector v i v^i v i has dimension of m i m^{i} m i and it is a linear combination of gradient of x i x^i x i with respect to j t h j^{th} j t h input element x j 1 x^1_j x j 1 .
v i = c 1 ∣ ∂ x 1 i ∂ x 1 1 ∂ x 2 i ∂ x 1 1 ∂ x 3 i ∂ x 1 1 . . . ∂ x m i i ∂ x 1 1 ∣ + c 2 ∣ ∂ x 1 i ∂ x 2 1 ∂ x 2 i ∂ x 2 1 ∂ x 3 i ∂ x 2 1 . . . ∂ x m i i ∂ x 2 1 ∣ + . . . + + c m 1 ∣ ∂ x 1 i ∂ x m 1 1 ∂ x 2 i ∂ x m 1 1 ∂ x 3 i ∂ x m 1 1 . . . ∂ x m i i ∂ x m 1 1 ∣
v^i = c_1
\begin{vmatrix} \frac{\partial x^i_1}{\partial x^1_1} \\ \frac{\partial x^i_2}{\partial x^1_1} \\ \frac{\partial x^i_3}{\partial x^1_1} \\ ...\\ \frac{\partial x^i_{m^{i}}}{\partial x^1_1}
\end{vmatrix}
+ c_2
\begin{vmatrix} \frac{\partial x^i_1}{\partial x^1_2} \\ \frac{\partial x^i_2}{\partial x^1_2} \\ \frac{\partial x^i_3}{\partial x^1_2} \\ ...\\ \frac{\partial x^i_{m^{i}}}{\partial x^1_2}
\end{vmatrix}
+
...
+
+ c_{m^{1}}
\begin{vmatrix} \frac{\partial x^i_1}{\partial x^1_{m^1}} \\ \frac{\partial x^i_2}{\partial x^1_{m^1}} \\ \frac{\partial x^i_3}{\partial x^1_{m^1}} \\ ...\\ \frac{\partial x^i_{m^{i}}}{\partial x^1_{m^1}}
\end{vmatrix}
v i = c 1 ∣ ∣ ∂ x 1 1 ∂ x 1 i ∂ x 1 1 ∂ x 2 i ∂ x 1 1 ∂ x 3 i ... ∂ x 1 1 ∂ x m i i ∣ ∣ + c 2 ∣ ∣ ∂ x 2 1 ∂ x 1 i ∂ x 2 1 ∂ x 2 i ∂ x 2 1 ∂ x 3 i ... ∂ x 2 1 ∂ x m i i ∣ ∣ + ... + + c m 1 ∣ ∣ ∂ x m 1 1 ∂ x 1 i ∂ x m 1 1 ∂ x 2 i ∂ x m 1 1 ∂ x 3 i ... ∂ x m 1 1 ∂ x m i i ∣ ∣
Applying Jacobian J i J^i J i , we have
J i v i = c 1 J i ∣ ∂ x 1 i ∂ x 1 1 ∂ x 2 i ∂ x 1 1 ∂ x 3 i ∂ x 1 1 . . . ∂ x m i i ∂ x 1 1 ∣ + c 2 J i ∣ ∂ x 1 i ∂ x 2 1 ∂ x 2 i ∂ x 2 1 ∂ x 3 i ∂ x 2 1 . . . ∂ x m i i ∂ x 2 1 ∣ + . . . + + c m 1 J i ∣ ∂ x 1 i ∂ x m 1 1 ∂ x 2 i ∂ x m 1 1 ∂ x 3 i ∂ x m 1 1 . . . ∂ x m i i ∂ x m 1 1 ∣ = c 1 ∣ ∂ x 1 i + 1 ∂ x 1 1 ∂ x 2 i + 1 ∂ x 1 1 ∂ x 3 i + 1 ∂ x 1 1 . . . ∂ x m i + 1 i + 1 ∂ x 1 1 ∣ + c 2 ∣ ∂ x 1 i + 1 ∂ x 2 1 ∂ x 2 i + 1 ∂ x 2 1 ∂ x 3 i + 1 ∂ x 2 1 . . . ∂ x m i + 1 i + 1 ∂ x 2 1 ∣ + . . . + + c m 1 ∣ ∂ x 1 i + 1 ∂ x m 1 1 ∂ x 2 i + 1 ∂ x m 1 1 ∂ x 3 i + 1 ∂ x m 1 1 . . . ∂ x m i + 1 i + 1 ∂ x m 1 1 ∣ = v i + 1
\begin{split}
J^i v^i &= c_1 J^i
\begin{vmatrix} \frac{\partial x^i_1}{\partial x^1_1} \\ \frac{\partial x^i_2}{\partial x^1_1} \\ \frac{\partial x^i_3}{\partial x^1_1} \\ ...\\ \frac{\partial x^i_{m^{i}}}{\partial x^1_1}
\end{vmatrix}
+ c_2 J^i
\begin{vmatrix} \frac{\partial x^i_1}{\partial x^1_2} \\ \frac{\partial x^i_2}{\partial x^1_2} \\ \frac{\partial x^i_3}{\partial x^1_2} \\ ...\\ \frac{\partial x^i_{m^{i}}}{\partial x^1_2}
\end{vmatrix}
+
...
+
+ c_{m^{1}} J^i
\begin{vmatrix} \frac{\partial x^i_1}{\partial x^1_{m^1}} \\ \frac{\partial x^i_2}{\partial x^1_{m^1}} \\ \frac{\partial x^i_3}{\partial x^1_{m^1}} \\ ...\\ \frac{\partial x^i_{m^{i}}}{\partial x^1_{m^1}}
\end{vmatrix} \\
&= c_1
\begin{vmatrix} \frac{\partial x^{i+1}_1}{\partial x^1_1} \\ \frac{\partial x^{i+1}_2}{\partial x^1_1} \\ \frac{\partial x^{i+1}_3}{\partial x^1_1} \\ ...\\ \frac{\partial x^{i+1}_{m^{i+1}}}{\partial x^1_1}
\end{vmatrix}
+ c_2
\begin{vmatrix} \frac{\partial x^{i+1}_1}{\partial x^1_2} \\ \frac{\partial x^{i+1}_2}{\partial x^1_2} \\ \frac{\partial x^{i+1}_3}{\partial x^1_2} \\ ...\\ \frac{\partial x^{i+1}_{m^{i+1}}}{\partial x^1_2}
\end{vmatrix}
+
...
+
+ c_{m^{1}}
\begin{vmatrix} \frac{\partial x^{i+1}_1}{\partial x^1_{m^1}} \\ \frac{\partial x^{i+1}_2}{\partial x^1_{m^1}} \\ \frac{\partial x^{i+1}_3}{\partial x^1_{m^1}} \\ ...\\ \frac{\partial x^{i+1}_{m^{i+1}}}{\partial x^1_{m^1}}
\end{vmatrix} \\
&= v^{i+1}
\end{split}
J i v i = c 1 J i ∣ ∣ ∂ x 1 1 ∂ x 1 i ∂ x 1 1 ∂ x 2 i ∂ x 1 1 ∂ x 3 i ... ∂ x 1 1 ∂ x m i i ∣ ∣ + c 2 J i ∣ ∣ ∂ x 2 1 ∂ x 1 i ∂ x 2 1 ∂ x 2 i ∂ x 2 1 ∂ x 3 i ... ∂ x 2 1 ∂ x m i i ∣ ∣ + ... + + c m 1 J i ∣ ∣ ∂ x m 1 1 ∂ x 1 i ∂ x m 1 1 ∂ x 2 i ∂ x m 1 1 ∂ x 3 i ... ∂ x m 1 1 ∂ x m i i ∣ ∣ = c 1 ∣ ∣ ∂ x 1 1 ∂ x 1 i + 1 ∂ x 1 1 ∂ x 2 i + 1 ∂ x 1 1 ∂ x 3 i + 1 ... ∂ x 1 1 ∂ x m i + 1 i + 1 ∣ ∣ + c 2 ∣ ∣ ∂ x 2 1 ∂ x 1 i + 1 ∂ x 2 1 ∂ x 2 i + 1 ∂ x 2 1 ∂ x 3 i + 1 ... ∂ x 2 1 ∂ x m i + 1 i + 1 ∣ ∣ + ... + + c m 1 ∣ ∣ ∂ x m 1 1 ∂ x 1 i + 1 ∂ x m 1 1 ∂ x 2 i + 1 ∂ x m 1 1 ∂ x 3 i + 1 ... ∂ x m 1 1 ∂ x m i + 1 i + 1 ∣ ∣ = v i + 1
Apply this forward step layer by layer, for the last layer N N N , we get v N + 1 v^{N+1} v N + 1
v N + 1 = c 1 ∣ ∂ x 1 N + 1 ∂ x 1 1 ∂ x 2 N + 1 ∂ x 1 1 ∂ x 3 N + 1 ∂ x 1 1 . . . ∂ x m i N + 1 ∂ x 1 1 ∣ + c 2 ∣ ∂ x 1 N + 1 ∂ x 2 1 ∂ x 2 N + 1 ∂ x 2 1 ∂ x 3 N + 1 ∂ x 2 1 . . . ∂ x m i N + 1 ∂ x 2 1 ∣ + . . . + + c m 1 ∣ ∂ x 1 N + 1 ∂ x m 1 1 ∂ x 2 N + 1 ∂ x m 1 1 ∂ x 3 N + 1 ∂ x m 1 1 . . . ∂ x m i N + 1 ∂ x m 1 1 ∣
v^{N+1} = c_1
\begin{vmatrix} \frac{\partial x^{N+1}_1}{\partial x^1_1} \\ \frac{\partial x^{N+1}_2}{\partial x^1_1} \\ \frac{\partial x^{N+1}_3}{\partial x^1_1} \\ ...\\ \frac{\partial x^{N+1}_{m^{i}}}{\partial x^1_1}
\end{vmatrix}
+ c_2
\begin{vmatrix} \frac{\partial x^{N+1}_1}{\partial x^1_2} \\ \frac{\partial x^{N+1}_2}{\partial x^1_2} \\ \frac{\partial x^{N+1}_3}{\partial x^1_2} \\ ...\\ \frac{\partial x^{N+1}_{m^{i}}}{\partial x^1_2}
\end{vmatrix}
+
...
+
+ c_{m^{1}}
\begin{vmatrix} \frac{\partial x^{N+1}_1}{\partial x^1_{m^1}} \\ \frac{\partial x^{N+1}_2}{\partial x^1_{m^1}} \\ \frac{\partial x^{N+1}_3}{\partial x^1_{m^1}} \\ ...\\ \frac{\partial x^{N+1}_{m^{i}}}{\partial x^1_{m^1}}
\end{vmatrix}
v N + 1 = c 1 ∣ ∣ ∂ x 1 1 ∂ x 1 N + 1 ∂ x 1 1 ∂ x 2 N + 1 ∂ x 1 1 ∂ x 3 N + 1 ... ∂ x 1 1 ∂ x m i N + 1 ∣ ∣ + c 2 ∣ ∣ ∂ x 2 1 ∂ x 1 N + 1 ∂ x 2 1 ∂ x 2 N + 1 ∂ x 2 1 ∂ x 3 N + 1 ... ∂ x 2 1 ∂ x m i N + 1 ∣ ∣ + ... + + c m 1 ∣ ∣ ∂ x m 1 1 ∂ x 1 N + 1 ∂ x m 1 1 ∂ x 2 N + 1 ∂ x m 1 1 ∂ x 3 N + 1 ... ∂ x m 1 1 ∂ x m i N + 1 ∣ ∣
In general, the forward mode evaluates a Jacobian-vector product J v Jv J v .
Backward Mode
Jacobian matrix transpose J i T ∈ R m i × m i + 1 {J^i}^{T} \in \Reals^{m^{i} \times m^{i+1}} J i T ∈ R m i × m i + 1 can be viewed as a linear map that converts a vector in high layer to a vector in lower layer, which is usually known as the push-back
mapping. The input vector v i + 1 v^{i+1} v i + 1 has the dimension of m i + 1 m^{i+1} m i + 1 and it is a linear combination of gradient of j t h j^{th} j t h element x j N + 1 x^{N+1}_j x j N + 1 with respect to x i + 1 x^{i+1} x i + 1 .
v i + 1 = c 1 ∣ ∂ x 1 N + 1 ∂ x 1 i + 1 ∂ x 1 N + 1 ∂ x 2 i + 1 ∂ x 1 N + 1 ∂ x 3 i + 1 . . . ∂ x 1 N + 1 ∂ x m i + 1 i + 1 ∣ + c 2 ∣ ∂ x 2 N + 1 ∂ x 1 i + 1 ∂ x 2 N + 1 ∂ x 2 i + 1 ∂ x 2 N + 1 ∂ x 3 i + 1 . . . ∂ x 2 N + 1 ∂ x m i + 1 i + 1 ∣ + . . . + + c m N + 1 ∣ ∂ x m N + 1 N + 1 ∂ x 1 i + 1 ∂ x m N + 1 N + 1 ∂ x 2 i + 1 ∂ x m N + 1 N + 1 ∂ x 3 i + 1 . . . ∂ x m N + 1 N + 1 ∂ x m i + 1 i + 1 ∣
v^{i+1} = c_1
\begin{vmatrix} \frac{\partial x^{N+1}_1}{\partial x^{i+1}_1} \\ \frac{\partial x^{N+1}_1}{\partial x^{i+1}_2} \\ \frac{\partial x^{N+1}_1}{\partial x^{i+1}_3} \\ ...\\ \frac{\partial x^{N+1}_1}{\partial x^{i+1}_{m^{i+1}}}
\end{vmatrix}
+ c_2
\begin{vmatrix} \frac{\partial x^{N+1}_2}{\partial x^{i+1}_1} \\ \frac{\partial x^{N+1}_2}{\partial x^{i+1}_2} \\ \frac{\partial x^{N+1}_2}{\partial x^{i+1}_3} \\ ...\\ \frac{\partial x^{N+1}_2}{\partial x^{i+1}_{m^{i+1}}}
\end{vmatrix}
+
...
+
+ c_{m^{N+1}}
\begin{vmatrix} \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i+1}_1} \\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i+1}_2} \\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i+1}_3} \\ ...\\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i+1}_{m^{i+1}}}
\end{vmatrix}
v i + 1 = c 1 ∣ ∣ ∂ x 1 i + 1 ∂ x 1 N + 1 ∂ x 2 i + 1 ∂ x 1 N + 1 ∂ x 3 i + 1 ∂ x 1 N + 1 ... ∂ x m i + 1 i + 1 ∂ x 1 N + 1 ∣ ∣ + c 2 ∣ ∣ ∂ x 1 i + 1 ∂ x 2 N + 1 ∂ x 2 i + 1 ∂ x 2 N + 1 ∂ x 3 i + 1 ∂ x 2 N + 1 ... ∂ x m i + 1 i + 1 ∂ x 2 N + 1 ∣ ∣ + ... + + c m N + 1 ∣ ∣ ∂ x 1 i + 1 ∂ x m N + 1 N + 1 ∂ x 2 i + 1 ∂ x m N + 1 N + 1 ∂ x 3 i + 1 ∂ x m N + 1 N + 1 ... ∂ x m i + 1 i + 1 ∂ x m N + 1 N + 1 ∣ ∣
Applying Jacobian J i T {J^i}^{T} J i T , we have
J i T v i + 1 = c 1 J i T ∣ ∂ x 1 N + 1 ∂ x 1 i + 1 ∂ x 1 N + 1 ∂ x 2 i + 1 ∂ x 1 N + 1 ∂ x 3 i + 1 . . . ∂ x 1 N + 1 ∂ x m i + 1 i + 1 ∣ + c 2 J i T ∣ ∂ x 2 N + 1 ∂ x 1 i + 1 ∂ x 2 N + 1 ∂ x 2 i + 1 ∂ x 2 N + 1 ∂ x 3 i + 1 . . . ∂ x 2 N + 1 ∂ x m i + 1 i + 1 ∣ + . . . + + c m N + 1 J i T ∣ ∂ x m N + 1 N + 1 ∂ x 1 i + 1 ∂ x m N + 1 N + 1 ∂ x 2 i + 1 ∂ x m N + 1 N + 1 ∂ x 3 i + 1 . . . ∂ x m N + 1 N + 1 ∂ x m i + 1 i + 1 ∣ = c 1 ∣ ∂ x 1 N + 1 ∂ x 1 i ∂ x 1 N + 1 ∂ x 2 i ∂ x 1 N + 1 ∂ x 3 i . . . ∂ x 1 N + 1 ∂ x m i i ∣ + c 2 ∣ ∂ x 2 N + 1 ∂ x 1 i ∂ x 2 N + 1 ∂ x 2 i ∂ x 2 N + 1 ∂ x 3 i . . . ∂ x 2 N + 1 ∂ x m i i ∣ + . . . + + c m N + 1 ∣ ∂ x m N + 1 N + 1 ∂ x 1 i ∂ x m N + 1 N + 1 ∂ x 2 i ∂ x m N + 1 N + 1 ∂ x 3 i . . . ∂ x m N + 1 N + 1 ∂ x m i i ∣ = v i
\begin{split}
{J^i}^{T} v^{i+1} &= c_1 {J^i}^{T}
\begin{vmatrix} \frac{\partial x^{N+1}_1}{\partial x^{i+1}_1} \\ \frac{\partial x^{N+1}_1}{\partial x^{i+1}_2} \\ \frac{\partial x^{N+1}_1}{\partial x^{i+1}_3} \\ ...\\ \frac{\partial x^{N+1}_1}{\partial x^{i+1}_{m^{i+1}}}
\end{vmatrix}
+ c_2 {J^i}^{T}
\begin{vmatrix} \frac{\partial x^{N+1}_2}{\partial x^{i+1}_1} \\ \frac{\partial x^{N+1}_2}{\partial x^{i+1}_2} \\ \frac{\partial x^{N+1}_2}{\partial x^{i+1}_3} \\ ...\\ \frac{\partial x^{N+1}_2}{\partial x^{i+1}_{m^{i+1}}}
\end{vmatrix}
+
...
+
+ c_{m^{N+1}} {J^i}^{T}
\begin{vmatrix} \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i+1}_1} \\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i+1}_2} \\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i+1}_3} \\ ...\\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i+1}_{m^{i+1}}}
\end{vmatrix} \\
&= c_1
\begin{vmatrix} \frac{\partial x^{N+1}_1}{\partial x^{i}_1} \\ \frac{\partial x^{N+1}_1}{\partial x^{i}_2} \\ \frac{\partial x^{N+1}_1}{\partial x^{i}_3} \\ ...\\ \frac{\partial x^{N+1}_1}{\partial x^{i}_{m^{i}}}
\end{vmatrix}
+ c_2
\begin{vmatrix} \frac{\partial x^{N+1}_2}{\partial x^{i}_1} \\ \frac{\partial x^{N+1}_2}{\partial x^{i}_2} \\ \frac{\partial x^{N+1}_2}{\partial x^{i}_3} \\ ...\\ \frac{\partial x^{N+1}_2}{\partial x^{i}_{m^{i}}}
\end{vmatrix}
+
...
+
+ c_{m^{N+1}}
\begin{vmatrix} \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i}_1} \\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i}_2} \\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i}_3} \\ ...\\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{i}_{m^{i}}}
\end{vmatrix} \\
&= v^i
\end{split}
J i T v i + 1 = c 1 J i T ∣ ∣ ∂ x 1 i + 1 ∂ x 1 N + 1 ∂ x 2 i + 1 ∂ x 1 N + 1 ∂ x 3 i + 1 ∂ x 1 N + 1 ... ∂ x m i + 1 i + 1 ∂ x 1 N + 1 ∣ ∣ + c 2 J i T ∣ ∣ ∂ x 1 i + 1 ∂ x 2 N + 1 ∂ x 2 i + 1 ∂ x 2 N + 1 ∂ x 3 i + 1 ∂ x 2 N + 1 ... ∂ x m i + 1 i + 1 ∂ x 2 N + 1 ∣ ∣ + ... + + c m N + 1 J i T ∣ ∣ ∂ x 1 i + 1 ∂ x m N + 1 N + 1 ∂ x 2 i + 1 ∂ x m N + 1 N + 1 ∂ x 3 i + 1 ∂ x m N + 1 N + 1 ... ∂ x m i + 1 i + 1 ∂ x m N + 1 N + 1 ∣ ∣ = c 1 ∣ ∣ ∂ x 1 i ∂ x 1 N + 1 ∂ x 2 i ∂ x 1 N + 1 ∂ x 3 i ∂ x 1 N + 1 ... ∂ x m i i ∂ x 1 N + 1 ∣ ∣ + c 2 ∣ ∣ ∂ x 1 i ∂ x 2 N + 1 ∂ x 2 i ∂ x 2 N + 1 ∂ x 3 i ∂ x 2 N + 1 ... ∂ x m i i ∂ x 2 N + 1 ∣ ∣ + ... + + c m N + 1 ∣ ∣ ∂ x 1 i ∂ x m N + 1 N + 1 ∂ x 2 i ∂ x m N + 1 N + 1 ∂ x 3 i ∂ x m N + 1 N + 1 ... ∂ x m i i ∂ x m N + 1 N + 1 ∣ ∣ = v i
Apply this backward step layer by layer, for the first layer 1 1 1 , we get v 1 v^1 v 1
v 1 = c 1 ∣ ∂ x 1 N + 1 ∂ x 1 1 ∂ x 1 N + 1 ∂ x 2 1 ∂ x 1 N + 1 ∂ x 3 1 . . . ∂ x 1 N + 1 ∂ x m 1 1 ∣ + c 2 ∣ ∂ x 2 N + 1 ∂ x 1 1 ∂ x 2 N + 1 ∂ x 2 1 ∂ x 2 N + 1 ∂ x 3 1 . . . ∂ x 2 N + 1 ∂ x m 1 1 ∣ + . . . + + c m N + 1 ∣ ∂ x m N + 1 N + 1 ∂ x 1 1 ∂ x m N + 1 N + 1 ∂ x 2 1 ∂ x m N + 1 N + 1 ∂ x 3 1 . . . ∂ x m N + 1 N + 1 ∂ x m 1 1 ∣
v^1 = c_1
\begin{vmatrix} \frac{\partial x^{N+1}_1}{\partial x^{1}_1} \\ \frac{\partial x^{N+1}_1}{\partial x^{1}_2} \\ \frac{\partial x^{N+1}_1}{\partial x^{1}_3} \\ ...\\ \frac{\partial x^{N+1}_1}{\partial x^{1}_{m^{1}}}
\end{vmatrix}
+ c_2
\begin{vmatrix} \frac{\partial x^{N+1}_2}{\partial x^{1}_1} \\ \frac{\partial x^{N+1}_2}{\partial x^{1}_2} \\ \frac{\partial x^{N+1}_2}{\partial x^{1}_3} \\ ...\\ \frac{\partial x^{N+1}_2}{\partial x^{1}_{m^{1}}}
\end{vmatrix}
+
...
+
+ c_{m^{N+1}}
\begin{vmatrix} \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{1}_1} \\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{1}_2} \\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{1}_3} \\ ...\\ \frac{\partial x^{N+1}_{m^{N+1}}}{\partial x^{1}_{m^{1}}}
\end{vmatrix}
v 1 = c 1 ∣ ∣ ∂ x 1 1 ∂ x 1 N + 1 ∂ x 2 1 ∂ x 1 N + 1 ∂ x 3 1 ∂ x 1 N + 1 ... ∂ x m 1 1 ∂ x 1 N + 1 ∣ ∣ + c 2 ∣ ∣ ∂ x 1 1 ∂ x 2 N + 1 ∂ x 2 1 ∂ x 2 N + 1 ∂ x 3 1 ∂ x 2 N + 1 ... ∂ x m 1 1 ∂ x 2 N + 1 ∣ ∣ + ... + + c m N + 1 ∣ ∣ ∂ x 1 1 ∂ x m N + 1 N + 1 ∂ x 2 1 ∂ x m N + 1 N + 1 ∂ x 3 1 ∂ x m N + 1 N + 1 ... ∂ x m 1 1 ∂ x m N + 1 N + 1 ∣ ∣
In general, the backward mode evaluates a transposed Jacobian-vector product J T v J^Tv J T v , or sometimes is expressed as v T J v^T J v T J .
References
Baydin, Atılım Güneş, et al. "Gradients without Backpropagation." arXiv preprint arXiv:2202.08587 (2022).