Table of Contents
Fetching ...

Weight Initialization and Variance Dynamics in Deep Neural Networks and Large Language Models

Yankun Han

TL;DR

The paper addresses how weight initialization affects signal propagation and gradient flow in deep networks and large language models. It develops a theory of forward and backward variance propagation for rectifiers (ReLU/GELU) and examines its implications for GPT-2–style transformers, complemented by a logarithmic sweep to identify a practical stability band, $\sigma \in [10^{-2}, 10^{-1}]$. Empirically, it shows that Kaiming (fan-in) initialization yields faster, more stable convergence than Xavier under ReLU, and that in a from-scratch 12-layer GPT-2–style model, layerwise weight variance exhibits depth-dependent equilibration: shallow layers adapt quickly while deeper layers evolve more gradually, converging to narrow variance bands. The practical takeaway is simple: use Kaiming initialization for rectifiers, initialize transformer projections with small std (about $0.02$), monitor per-layer variance and gradient norms, and apply residual scaling or warmup adjustments to maintain healthy variance flow across depth.

Abstract

Weight initialization governs signal propagation and gradient flow at the start of training. This paper offers a theory-grounded and empirically validated study across two regimes: compact ReLU multilayer perceptrons and GPT-2-style transformers. First, a logarithmic sweep of the initial standard deviation maps vanishing and exploding regimes and identifies a broad stability band with standard deviations between 1e-2 and 1e-1. Second, a controlled comparison shows that Kaiming (fan-in) initialization converges faster and more stably than Xavier under ReLU, consistent with variance-preserving theory. Third, in a from-scratch 12-layer GPT-2-style model, this paper tracks layerwise Q/K/V weight variance through pretraining and observe depth-dependent equilibration into narrow bands: shallow layers expand rapidly while deeper layers change more gradually. Together, these results connect classic initialization principles with modern transformer behavior and yield simple, practical recipes for robust training.

Weight Initialization and Variance Dynamics in Deep Neural Networks and Large Language Models

TL;DR

The paper addresses how weight initialization affects signal propagation and gradient flow in deep networks and large language models. It develops a theory of forward and backward variance propagation for rectifiers (ReLU/GELU) and examines its implications for GPT-2–style transformers, complemented by a logarithmic sweep to identify a practical stability band, . Empirically, it shows that Kaiming (fan-in) initialization yields faster, more stable convergence than Xavier under ReLU, and that in a from-scratch 12-layer GPT-2–style model, layerwise weight variance exhibits depth-dependent equilibration: shallow layers adapt quickly while deeper layers evolve more gradually, converging to narrow variance bands. The practical takeaway is simple: use Kaiming initialization for rectifiers, initialize transformer projections with small std (about ), monitor per-layer variance and gradient norms, and apply residual scaling or warmup adjustments to maintain healthy variance flow across depth.

Abstract

Weight initialization governs signal propagation and gradient flow at the start of training. This paper offers a theory-grounded and empirically validated study across two regimes: compact ReLU multilayer perceptrons and GPT-2-style transformers. First, a logarithmic sweep of the initial standard deviation maps vanishing and exploding regimes and identifies a broad stability band with standard deviations between 1e-2 and 1e-1. Second, a controlled comparison shows that Kaiming (fan-in) initialization converges faster and more stably than Xavier under ReLU, consistent with variance-preserving theory. Third, in a from-scratch 12-layer GPT-2-style model, this paper tracks layerwise Q/K/V weight variance through pretraining and observe depth-dependent equilibration into narrow bands: shallow layers expand rapidly while deeper layers change more gradually. Together, these results connect classic initialization principles with modern transformer behavior and yield simple, practical recipes for robust training.

Paper Structure

This paper contains 17 sections, 3 equations, 6 figures.

Figures (6)

  • Figure 1: E1: Loss trajectories for selected initialization scales. Stable learning occurs for $\sigma \in [10^{-2}, 10^{-1}]$, while too small or too large $\sigma$ causes vanishing or unstable behavior.
  • Figure 2: E2: Loss, train accuracy, and test accuracy for Xavier (blue) vs. Kaiming (orange) under identical settings. Kaiming shows faster decay and higher accuracy on the training dataset.
  • Figure 3: E2 (t-test): Aggregated statistics over 10 runs. Paired t-tests show significant differences between Xavier and Kaiming in both loss and training accuracy ($p < 0.05$).
  • Figure 4: E3: Loss curves (train vs. test) during pre-training of a from-scratch GPT-2 model.
  • Figure 5: E3 (one sample layer's weight distribution): Weight-distribution dynamics sampled every 50 epochs. The distributions become increasingly sparse over time, with mass progressively concentrating near zero.
  • ...and 1 more figures