Table of Contents
Fetching ...

Why Deep Jacobian Spectra Separate: Depth-Induced Scaling and Singular-Vector Alignment

Nathanaël Haas, Francçois Gatine, Augustin M Cosse, Zied Bouraoui

TL;DR

The paper identifies depth-induced exponential scaling of ordered Jacobian singular values and spectral separation as key signatures governing the dynamics of Jacobian spectra in deep networks. By introducing Fixed-Gates Linear Networks and gated products, it proves the existence of Lyapunov exponents for the top singular values at initialization and shows how spectral separation enforces alignment of dominant singular directions across products, enabling an approximate deep-linear-like, mode-wise singular-value evolution without balancing. The authors provide a rigorous theoretical framework complemented by experiments demonstrating depth scaling and alignment in fixed-gates models trained on MNIST, suggesting a mechanistic basis for emergent low-rank Jacobian structure and implicit bias. Overall, depth scaling coupled with spectral separation offers a tractable path to understanding gradient-based training biases in deep architectures and informs potential strategies for analyzing generalization in practice.

Abstract

Understanding why gradient-based training in deep networks exhibits strong implicit bias remains challenging, in part because tractable singular-value dynamics are typically available only for balanced deep linear models. We propose an alternative route based on two theoretically grounded and empirically testable signatures of deep Jacobians: depth-induced exponential scaling of ordered singular values and strong spectral separation. Adopting a fixed-gates view of piecewise-linear networks, where Jacobians reduce to products of masked linear maps within a single activation region, we prove the existence of Lyapunov exponents governing the top singular values at initialization, give closed-form expressions in a tractable masked model, and quantify finite-depth corrections. We further show that sufficiently strong separation forces singular-vector alignment in matrix products, yielding an approximately shared singular basis for intermediate Jacobians. Together, these results motivate an approximation regime in which singular-value dynamics become effectively decoupled, mirroring classical balanced deep-linear analyses without requiring balancing. Experiments in fixed-gates settings validate the predicted scaling, alignment, and resulting dynamics, supporting a mechanistic account of emergent low-rank Jacobian structure as a driver of implicit bias.

Why Deep Jacobian Spectra Separate: Depth-Induced Scaling and Singular-Vector Alignment

TL;DR

The paper identifies depth-induced exponential scaling of ordered Jacobian singular values and spectral separation as key signatures governing the dynamics of Jacobian spectra in deep networks. By introducing Fixed-Gates Linear Networks and gated products, it proves the existence of Lyapunov exponents for the top singular values at initialization and shows how spectral separation enforces alignment of dominant singular directions across products, enabling an approximate deep-linear-like, mode-wise singular-value evolution without balancing. The authors provide a rigorous theoretical framework complemented by experiments demonstrating depth scaling and alignment in fixed-gates models trained on MNIST, suggesting a mechanistic basis for emergent low-rank Jacobian structure and implicit bias. Overall, depth scaling coupled with spectral separation offers a tractable path to understanding gradient-based training biases in deep architectures and informs potential strategies for analyzing generalization in practice.

Abstract

Understanding why gradient-based training in deep networks exhibits strong implicit bias remains challenging, in part because tractable singular-value dynamics are typically available only for balanced deep linear models. We propose an alternative route based on two theoretically grounded and empirically testable signatures of deep Jacobians: depth-induced exponential scaling of ordered singular values and strong spectral separation. Adopting a fixed-gates view of piecewise-linear networks, where Jacobians reduce to products of masked linear maps within a single activation region, we prove the existence of Lyapunov exponents governing the top singular values at initialization, give closed-form expressions in a tractable masked model, and quantify finite-depth corrections. We further show that sufficiently strong separation forces singular-vector alignment in matrix products, yielding an approximately shared singular basis for intermediate Jacobians. Together, these results motivate an approximation regime in which singular-value dynamics become effectively decoupled, mirroring classical balanced deep-linear analyses without requiring balancing. Experiments in fixed-gates settings validate the predicted scaling, alignment, and resulting dynamics, supporting a mechanistic account of emergent low-rank Jacobian structure as a driver of implicit bias.
Paper Structure (36 sections, 20 theorems, 134 equations, 12 figures)

This paper contains 36 sections, 20 theorems, 134 equations, 12 figures.

Key Result

Proposition 4.5

Consider a Masked Linear Network $J(t)=W_L(t) D_{L-1}\cdots D_1 W_1(t)$ trained by gradient flow on the weights $(W_\ell(t))_{\ell=1}^L$ with loss $\mathcal{L}(J)$. Let $D_0\coloneq I$ and $D_L\coloneq I$, and define $M_\ell(t)\coloneq D_\ell W_\ell(t) D_{\ell-1}$ for $\ell=1,\dots,L$. Assume the ba Under gradient flow, this balancing condition is preserved along training. Let $J(t)=\sum_k s_k(t)

Figures (12)

  • Figure 1: Evolution of the top-15 Jacobian log-singular values during training of a fixed-gates linear network (depth 10, width 64) on a synthetic rank-10 regression task. Inputs are Gaussian and targets are generated by a fixed random linear map of rank 10. Colors index order: brighter curves correspond to larger singular values (from $s_1$ to $s_{15}$).
  • Figure 2: Convergence of $\frac{1}{L}\log s_{1,L}$ to $\gamma_1$ and comparison to the first-order correction $\gamma_1 + \frac{d_0-d_1}{L}$, for Gaussian weights and Bernoulli gates with $p=1$ (left) and $p=0.5$ (right).
  • Figure 3: Top 64 values of $\frac{1}{L}\log s_{k,L}$ compared to $\gamma_k$ and to the corrected prediction $\gamma_k + \frac{d_{k-1}-d_k}{L}$, for depth $L=20$ with Gaussian weights and Bernoulli gates with $p=1$ (left) and $p=0.5$ (right).
  • Figure 4: Spectrum of $\log s_{k,\ell}$ for the top 30 log-singular values of the intermediate Jacobians $B_{\ell}$ as $\ell$ varies, at initialization (left, $L=100$) and trained in the MNIST setting (right), $p=0.5$.
  • Figure 5: Diagonal correlation coefficient of $U_{J_L}^{\top}U_{A_{\ell}}$ at initialization (blue) and trained in the MNIST setting (orange) with $p=0.5$
  • ...and 7 more figures

Theorems & Definitions (55)

  • Definition 4.1: Fixed-Gates Linear Network
  • Remark 4.2
  • Definition 4.3: Multi-mode FGLN
  • Remark 4.4
  • Proposition 4.5: Balanced singular-value dynamics for masked linear networks
  • Definition 4.6: Depth scaling
  • Definition 4.7: Spectral separation (working definition)
  • Definition 5.1
  • Definition 5.2
  • Theorem 5.3
  • ...and 45 more