Table of Contents
Fetching ...

Residual Koopman Spectral Profiling for Predicting and Preventing Transformer Training Instability

Bum Jun Kim, Shohei Taniguchi, Makoto Kawano, Yusuke Iwasawa, Yutaka Matsuo

Abstract

Training divergence in transformers wastes compute, yet practitioners discover instability only after expensive runs begin. They therefore need an expected probability of failure for a transformer before training starts. Our study of Residual Koopman Spectral Profiling (RKSP) provides such an estimate. From a single forward pass at initialization, RKSP extracts Koopman spectral features by applying whitened dynamic mode decomposition to layer-wise residual snapshots. Our central diagnostic, the near-unit spectral mass, quantifies the fraction of modes concentrated near the unit circle, which captures instability risk. For predicting divergence across extensive configurations, this estimator achieves an AUROC of 0.995, outperforming the best gradient baseline. We further make this diagnostic actionable through Koopman Spectral Shaping (KSS), which reshapes spectra during training. We empirically validate that our method works in practice: RKSP predicts divergence at initialization, and when RKSP flags high risk, turning on KSS successfully prevents divergence. In the challenging high learning rate regime without normalization layers, KSS reduces the divergence rate from 66.7% to 12.5% and enables learning rates that are 50% to 150% higher. These findings generalize to WikiText-103 language modeling, vision transformers on CIFAR-10, and pretrained language models, including GPT-2 and LLaMA-2 up to 7B, as well as emerging architectures such as MoE, Mamba-style SSMs, and KAN.

Residual Koopman Spectral Profiling for Predicting and Preventing Transformer Training Instability

Abstract

Training divergence in transformers wastes compute, yet practitioners discover instability only after expensive runs begin. They therefore need an expected probability of failure for a transformer before training starts. Our study of Residual Koopman Spectral Profiling (RKSP) provides such an estimate. From a single forward pass at initialization, RKSP extracts Koopman spectral features by applying whitened dynamic mode decomposition to layer-wise residual snapshots. Our central diagnostic, the near-unit spectral mass, quantifies the fraction of modes concentrated near the unit circle, which captures instability risk. For predicting divergence across extensive configurations, this estimator achieves an AUROC of 0.995, outperforming the best gradient baseline. We further make this diagnostic actionable through Koopman Spectral Shaping (KSS), which reshapes spectra during training. We empirically validate that our method works in practice: RKSP predicts divergence at initialization, and when RKSP flags high risk, turning on KSS successfully prevents divergence. In the challenging high learning rate regime without normalization layers, KSS reduces the divergence rate from 66.7% to 12.5% and enables learning rates that are 50% to 150% higher. These findings generalize to WikiText-103 language modeling, vision transformers on CIFAR-10, and pretrained language models, including GPT-2 and LLaMA-2 up to 7B, as well as emerging architectures such as MoE, Mamba-style SSMs, and KAN.
Paper Structure (68 sections, 5 theorems, 42 equations, 7 figures, 17 tables, 1 algorithm)

This paper contains 68 sections, 5 theorems, 42 equations, 7 figures, 17 tables, 1 algorithm.

Key Result

Theorem 1

Let $\mathbf{A} \in \mathbb{C}^{d \times d}$ be normal with eigenvalues $\{\lambda_j\}_{j=1}^d$. For a unit vector $\mathbf{x}$ drawn uniformly on the sphere, If $\rho(\mathbf{A})\le 1+\epsilon_u$ and $M_{\approx 1}(\mathbf{A})$ denotes the fraction of eigenvalues with $|\lambda_j|\in [1-\epsilon_n,1+\epsilon_u]$, then Hence larger $M_{\approx 1}$ implies more energy-preserving and less damped p

Figures (7)

  • Figure 1: Scatter plot of DMD eigenvalues across layers in a pre-layer normalization transformer. The color gradient indicates layer depth; blue is early and red is late. Early layers cluster near the unit circle; late layers exhibit an increased spectral radius.
  • Figure 2: KSS regularization effectiveness. (Left) The divergence rate decreases with KSS weight $\alpha$. (Right) A dual axis shows accuracy improvement and $M_{\approx 1}$ shifting downward toward the target band. KSS shapes spectral properties, improving both stability and performance. Measured on the associative-recall task.
  • Figure 3: Scaling law for spectral properties. (Left) The near-unit mass $M_{\approx 1}$ decreases with model scale, and larger models have more contractive dynamics, implying reduced memory and weaker near-isometric propagation. (Right) The normalized linear-fit error $\eta_{\mathrm{nl}}$ increases with scale, indicating a less reliable linear approximation at scale. Log-linear fits are shown. Computed from residual-stream activations on a fixed set of short prompt sentences.
  • Figure 4: Start Linear, End Nonlinear pattern. Layer-wise normalized linear-fit error $\eta_{\mathrm{nl}}$ across four pretrained models. All models exhibit a monotonically increasing $\eta_{\mathrm{nl}}$ with depth, suggesting a consistent linear-approximation signature across models. Computed from residual-stream activations on a fixed set of short prompt sentences.
  • Figure 5: Calibration reliability diagram. (Left) Predicted divergence probability versus observed frequency, with an ECE of 0.283. (Right) Distribution of predictions separated by actual outcome. RKSP provides moderately calibrated probability estimates. Based on associative-recall runs, calibration compares predictions to divergence outcomes from that task.
  • ...and 2 more figures

Theorems & Definitions (15)

  • Definition 1: Spectral Mass Partition
  • Definition 2: KSS Regularization Loss
  • Theorem 1: Near-Unit Energy Preservation under Near-Normality
  • Corollary 2: Depth-wise Damping and Gradient Flow
  • Proposition 3: Bauer--Fike: Non-normality Caveat
  • proof : Proof
  • Theorem 4: Whitened DMD Finite-Sample Convergence
  • proof : Proof
  • Remark 1: On $\mathbf{G}_\epsilon^{-1}$ for $\boldsymbol{\Sigma}_\epsilon = \boldsymbol{\Sigma} + \epsilon\mathbf{I}$
  • Remark 2: Sample Complexity
  • ...and 5 more