The paper Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer[1] introduces a practical way to set the parameters for the deep neural network optimally so that the NN can do zero-shot hyperparameter transfers across width and depth and do maximal feature learning[2]. The paper shows mathematically that the network output and weight update won't blowup (independent of the NN width) if the maximal update parameterization is used. Thought the paper is nicely written with a lot of details written in the appendix. I found it still omits a lot of essential computation details. As my paper reading notes, I tried to fill those detailed derivations in this blog for future references. Make sure you read the paper before checking this blog. Otherwise it won't make any sense. Hopefully it is useful for you to understand the paper better.
Background: Matrix Differentiation
Linear Case
For matrix A∈Rm×n and B∈Rn×k, define C=AB∈Rm×k. If we know the gradient of output C as ∂C∂L∈Rk×m. The gradient with respect to B is
For matrix A∈Rm×n and B∈Rn×k, define C=σ(AB)∈Rm×k, where σ(x) is a element-wise non-linear function. If we know the gradient of output C as ∂C∂L∈Rk×m. The gradient with respect to B is
Proof of Eq. (0.9) is similar to proof of Eq. (0.8)
Forward Pass
Let's assume we have an 2-layer MLP neural network.
Input is a column vector x0∈Rv×1, where v is a finite dimension. E.g. vocabulary dimension.
The input layer converts the finite dimension to infinite d by matrix W1∈Rd×v.
Assume all the point-wise non-linear activation functions are σ(x)∈Rd×1.
The output x1 is defined as
x1=σ(W1x0)(1)
Similarly, the inter-layer weight W2∈Rd×d transform the input x1 to the output x2 by
x2=σ(W2x1)(2)
Lastly, the output layer matrix W3∈R,v×d convert the infinite dimension d into finite v.
x3=σ(W3x2)(3)
Add the MSE loss layer as
L=∥x3−y∥2(4)
Backward Pass
Starting from the loss L, we calculate ∂x3∂L∈R1×v as:
∂x3∂L=2(x3−y)T(5)
Following the 3 desiderata as in stated by Yang et al [1], i.e.
Every (pre)activation vector in a network should have Θ(1)-sized coordinates.
Neural network output should be O(1).
All parameters should be updated as much as possible (in terms of scaling in width) without
leading to divergence.
Since x0,x3∈Rv×1 and x1,x2∈Rd×1 are all Θ(1)-sized coordinates, we can see ∂x3∂L is of size Θ(1).
Since we know from Yang et al [1] that the output layer should have W3 of size Θ(1/d) to make x3 of size Θ(1). This is because x2 and W3 are correlated and x2 are of size Θ(1). Using law of large number, we can see W3x2 is of coordinate size Θ(1).
where ⋅ is point-wise multiplication. We define vector c2=2(x3−y)⋅σ′(W3x2)∈Rv×1. It is easy to see c2 has coordinate size of Θ(1). And ∂W3∂L is simplified to
∂W3∂L=x2c2T(7)
According to the 3 desiderata, we want
(W3−ηΔW3)x2 of coordinate size Θ(1), which implies both W3x2 and ηΔW3x2 have coordinate size of Θ(1). So, W3 has to be initialized with coordinate size of Θ(1/d). While for ηΔW3x2=ηc2x2Tx2. For SGD, we need to set η=1/d, so x2Tx2/d has coordinate size of Θ(1). Note, x2T is the x2 in the previous step. For Adam, x2Tx2 needs a constant 1/d to be independent of d.
Calculate ∂x2∂L∈R1×d according to Eq. (0.8)
∂x2∂L=∂x3∂L⋅σ′T(W3x2)W3=c2TW3(8)
Since v is finite, W3 is of order Θ(1/d), ∂x2∂L is of order Θ(1/d)
We define vector c1=(c2TW3)T⋅σ′(W2x1)∈Rd×1 which has coordinate size of Θ(1/d).
According to the 3 desiderata, we want
(W2−ηΔW2)x1 of coordinate size Θ(1), which implies both W2x1 and ηΔW2x1 have coordinate size of Θ(1). So, W2 has to be initialized with coordinate size of Θ(1/d). While for ηΔW2x1=ηc1x1Tx1. For SGD, we need to set η=1, so c1x1Tx1 has coordinate size of Θ(1) because c1 has order of Θ(1/d). Note, x1T is the x1 in the previous step. For Adam, x1Tx1 needs a constant 1/d because of the Θ(1/d) term c1 in the gradient is canceled out by normalization.
According to the 3 desiderata, we want
(W1−ηΔW1)x0 of coordinate size Θ(1), which implies both W1x0 and ηΔW1x0 have coordinate size of Θ(1). So, W1 has to be initialized with coordinate size of Θ(1) because of the finite x0 dimension. While for ηΔW1x0=ηW2Tc1⋅σ′(W1x0)x0Tx0. x0Tx0 is of order Θ(1) because of the finite v dimension. To counter for the extra Θ(1/d) from the c1 term, for SGD, we need to set η=d. Note, x0T is the x0 in the previous step. For Adam, it needs a constant 1 because the Θ(1/d) term c1 in the gradient is canceled out by normalization.
To see why x2t−1T at t−1 correlates with x2t at t. Calculate the x2t by the forward pass Eq. 2 and note that weight matrix W2t=W2t−1−ηc1t−1x1t−1T according to Eq. 9.
x2t=σ(W2tx1t))=(σ((W2t−1−ηc1x1t−1T)x1t))
And by forward Eq. 2,
x2t−1=σ(W2t−1x1t−1)
It is clear to see that both x2t and x2t−1 depend on the same term x1t−1, so they are positively correlated.
Self-attention Layer
From above example, we can see the layer output gradient has coordinate size of Θ(1/d) because the c1 term passed down from the output layer in the backward pass. So we have ∂O∂L has size Θ(1/d)
Ignoring the multiple heads, assume the input x∈Rs×d where s is the finite sequence dimension, the matrices K,Q,V∈Rd×d. The self-attention layer is
O=σ(dxQKTxT)xV
Note here we use d to scale the attention score as shown later it is necessary.
Calculate the ∂σ∂L according to Eq. (0.2):
∂σ∂L=xV∂O∂L(12)
Calculate the ∂xQ∂L according to Eq. (0.9):
∂xQ∂L=dKTxT∂σ∂L⋅σ′T(13)
Calculate the ∂Q∂L according to Eq. (0.1) and substitute Eq. 12 and Eq 13:
According to the 3 desiderata, we want
x(Q−ηΔQ) of coordinate size Θ(1), which implies both xQ and ηxΔQ have coordinate size of Θ(1). So, Q has to be initialized with coordinate size of Θ(1/d). While for ηxΔQ=ηxxTσ′⋅∂OT∂LdVTxTxK. VTxTxK is of order Θ(1) because of the finite s dimension. dxxT is of order Θ(1) by law of large numbers while the ∂O∂L term has order of Θ(1/d) as we discussed before. So this is the same case as the hidden layer. For SGD, we need to set η=1, because ∂O∂L term has order of Θ(1/d). Note, x1T is the x1 in the previous step. For Adam, η is set to a constant 1/d because of the Θ(1/d) term in the gradient is canceled out by normalization.
For ∂K∂L and ∂V∂L cases, they are similar to the ∂Q∂L case and we can use the hidden layer parameterization rules.
Layernorm and Bias
Layernorm layer is expressed mathematically as the following:
ln(x)=σx−μ⋅G+B
where the G∈Rd×1 is the gain parameter used to re-scale the standardized summed inputs. B∈Rd×1 is the bias parameter like any other bias terms. The mean is μ=n1∑i=1nxi and standard deviation σ=n1∑i=1n(xi−μ)2. According to the desiderata 1, the input x has value of magnitude of Θ(1) which is independent of size d. μ and σ all have value of size Θ(1), so is the standardized summed input xˉ=σx−μ.
Calculate the ∂G∂L:
∂G∂L=∂ln∂L⋅∂G∂ln=∂ln∂L⋅σx−μ=∂ln∂L⋅xˉ
Calculate the ∂B∂L:
∂B∂L=∂ln∂L⋅∂B∂ln=∂ln∂L⋅1=∂ln∂L
We can view the bias term B as parameters of an input layer with constant input 1∈R1. All of our conclusions for input layer apply here. Let's focus on the σx−μ⋅G term.
According to the 3 desiderata, we want x⋅(G−ηΔG) of coordinate size Θ(1), which implies both x⋅G and ηx⋅ΔG have coordinate size of Θ(1). So, G has to be initialized with coordinate size of Θ(1). While for ηx⋅ΔG=ηx⋅∂ln∂L⋅xˉ. We know x⋅xˉ has order of Θ(1). As shown previously, ∂ln∂L is of order Θ(1/d) because of the term passed down from the readout layer. To counter for it, for SGD, we need to set η=d. For Adam, it needs a constant 1 because the Θ(1/d) term from ∂ln∂L in the gradient is canceled out by normalization. We can see it is the exact same case as the input layer.
Intuition about Lemma J.1
In the paper[1], the ABC parameterization is not unique. Different sets of parameterization can be converted from one to the other according to the Lemma J.1
Lemma J.1. Let ft(x) denote the neural network function after t steps of training (using any fixed
sequence of batches), evaluated on input x. Consider a parameter tensor W with learning rate C,
initialized as W∽N(0,B2), and with a multiplier A. Then for any θ>0, ft(x) stays fixed for all t and x if we set
when the optimizer is SGDA←Aθ,B←B/θ,C←C/θ2
when the optimizer is Adam,A←Aθ,B←B/θ,C←C/θ
To see why it is the case, let's write down the forward pass equation as
f0(x)=σ(αW0x)
where α is the multiplier constant. Note, α is initialized as A, W is initialized as a Gaussian random number ∽N(0,B2) and learning rate η is initialized as C. After one step, it becomes:
f1(x)=σ(αW1x)=σ(αW0x−ηαΔWx)(15)
Using RMS loss and following the same derivation as in Eq. 6
We can see that αWx term is invariant by A←Aθ,B←B/θ changes. And ΔW is increased by a factor of θ because of the extra α term. However, when using the Adam optimizer, the factor of θ is canceled out by the second momentum normalization.
Substitute Eq. 16 to Eq. 15, we notice the first term αW0x is invariant under the A←Aθ,B←B/θ changes. For the second term ηαΔWx, it is invariant under the C←C/θ2 for SGD optimizer, because both α and ΔW have a factor of θ. But for Adam optimizer, ΔW has factor of 1 due to the normalization, so we need C←C/θ to make it invariant.
We have seen the case for a single layer. What about multiple layers? Let's calculate the gradient with respect to the lower layer output (or current layer's input) like Eq. 8.
∂x∂L=α∂f∂L⋅σ′T(αWx)W=2α(x−y)T⋅σ′T(αWx)W(17)
It is easy to see the αW is invariant under A←Aθ,B←B/θ. So gradient passed down in the backward step is invariant under Lemma J.1 . The same invariant arguments for single layer above applies for multiple layer neural networks recursively by replacing ∂f∂L with ∂x∂L in Eq. 16 and Eq. 17.
References
Yang, Greg, et al. "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer." arXiv preprint arXiv:2203.03466 (2022).
Yang, Greg, and Edward J. Hu. "Feature learning in infinite-width neural networks." arXiv preprint arXiv:2011.14522 (2020).