On the Weight Dynamics of Deep Normalized Networks

Christian H. X. Ali Mehmeti-Göpel; Michael Wand

On the Weight Dynamics of Deep Normalized Networks

Christian H. X. Ali Mehmeti-Göpel, Michael Wand

TL;DR

Problem: ELR disparities across layers in normalization-based networks hinder trainability. Approach: builds a discrete/continuous dynamical model of weight and gradient norms and ELR evolution, predicting when ELRs converge or flip with respect to a constant learning rate. Contributions: (i) a general auto rate-tuning theory with closed-form gradient-flow solution $\frac{d\sigma^2}{dt}=\frac{c^2}{\sigma^2}$ and ELR limit $\lim_{t\to\infty} \frac{E_\ell}{E_k}=1$, (ii) identification of regime boundaries and a hyperparameter-free warm-up method, (iii) empirical validation on CNNs and Transformers and a constrained-ELR training technique. Findings: ELR spread is minimized with small or moderate learning rates and is further reduced by momentum and warm-up, enabling training of very deep networks; constrained ELR training can stabilize training in otherwise unstable regimes. Impact: provides practical stabilization guidelines for deep normalization-based architectures and informs design choices to improve trainability in CNNs and Transformer models.

Abstract

Recent studies have shown that high disparities in effective learning rates (ELRs) across layers in deep neural networks can negatively affect trainability. We formalize how these disparities evolve over time by modeling weight dynamics (evolution of expected gradient and weight norms) of networks with normalization layers, predicting the evolution of layer-wise ELR ratios. We prove that when training with any constant learning rate, ELR ratios converge to 1, despite initial gradient explosion. We identify a ``critical learning rate" beyond which ELR disparities widen, which only depends on current ELRs. To validate our findings, we devise a hyper-parameter-free warm-up method that successfully minimizes ELR spread quickly in theory and practice. Our experiments link ELR spread with trainability, a relationship that is most evident in very deep networks with significant gradient magnitude excursions.

On the Weight Dynamics of Deep Normalized Networks

TL;DR

and ELR limit

, (ii) identification of regime boundaries and a hyperparameter-free warm-up method, (iii) empirical validation on CNNs and Transformers and a constrained-ELR training technique. Findings: ELR spread is minimized with small or moderate learning rates and is further reduced by momentum and warm-up, enabling training of very deep networks; constrained ELR training can stabilize training in otherwise unstable regimes. Impact: provides practical stabilization guidelines for deep normalization-based architectures and informs design choices to improve trainability in CNNs and Transformer models.

Abstract

Paper Structure (30 sections, 5 theorems, 57 equations, 20 figures)

This paper contains 30 sections, 5 theorems, 57 equations, 20 figures.

Introduction
Related Work and Contributions
General Auto Rate-Tuning Effect and Its Dynamics
Training Dynamics Induced by Auto Rate-Tuning
Auto Rate-Tuning Affects Each Layer Separately
Assumptions on Initial Gradient Norm
Asymptotic Behavior of Effective Learning Rate Ratios in the Gradient Flow
Asymptotic Behavior of Effective Learning Rate Ratios for bigger Step Sizes
Sufficient Conditions for Auto Rate-Tuning
Simulations and Experimental Validation
Model Validation
Short Term Validation
Long Term Validation
The Effect of Higher Learning Rates: Discrete Model vs Continuous Model
Relationship Between Effective Learning Rate and Learning Rate
...and 15 more sections

Key Result

Lemma 3.2

Figures (20)

Figure 1: Simulated and real weight / gradient norm of the first layer in a Resnet56 NoShort at step 10 of training for different $\lambda$. "Drift" runs substitute real gradients by random vectors of similar norm.
Figure 2: Long term evolution of effective learning rates for a network with random gradients and without (left) and with (right) affine BatchNorm parameters.
Figure 3: Long term evolution of effective learning rates for a network with real gradients and without (left) and with (right) affine BatchNorm parameters.
Figure 4: Comparing the evolution of the layerwise simulated effective learning rate for different values of $\lambda$ in the discrete and the continuous model.
Figure 5: Simulated relative ELR spread after 10 steps in our discrete model for different values of $\lambda$.
...and 15 more figures

Theorems & Definitions (12)

Definition 3.1: Effective Learning Rate Ratios
Lemma 3.2: Stationary Point Is a Unique Attractor
proof
Definition 3.3: Flipping Ratio
Lemma 3.4: Flipping Conditions
proof
Lemma 3.5: Ratios Flip at Most Once
proof
Theorem 3.6: Convergence to Fixed Point
proof
...and 2 more

On the Weight Dynamics of Deep Normalized Networks

TL;DR

Abstract

On the Weight Dynamics of Deep Normalized Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (20)

Theorems & Definitions (12)