Table of Contents
Fetching ...

On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning

Thomas T. Zhang, Behrad Moniri, Ansh Nagwekar, Faraz Rahman, Anton Xue, Hamed Hassani, Nikolai Matni

TL;DR

This work investigates why layer-wise Kronecker-Factored preconditioning (KFAC/Shampoo) can outperform diagonal optimizers like Adam in neural network optimization, particularly for feature learning under anisotropic covariates. By analyzing two canonical models—linear representation learning and single-index learning—the authors show that SGD exhibits provable inefficiencies when inputs deviate from isotropy, and that a principled KFAC-style preconditioner yields a condition-number-free convergence and improved feature learning. They derive stylized KFAC updates with concrete contraction guarantees, demonstrate that full second-order methods underperform these layer-wise preconditioners, and provide comprehensive numerical validation across transfer learning and anisotropy settings. The results highlight a meaningful connection between optimization geometry and provable feature learning, with implications for practical training and generalization in deep networks.

Abstract

Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models, linear representation learning and single-index learning, which are widely used to study how typical algorithms efficiently learn useful features to enable generalization. In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs $\mathbf{x} \sim \mathsf{N}(\mathbf{0}, \mathbf{I})$ and well-conditioned settings typically assumed in prior work. We demonstrate theoretically and numerically that this suboptimality is fundamental, and that layer-wise preconditioning emerges naturally as the solution. We further show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues, supporting the unique benefits of layer-wise preconditioning.

On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning

TL;DR

This work investigates why layer-wise Kronecker-Factored preconditioning (KFAC/Shampoo) can outperform diagonal optimizers like Adam in neural network optimization, particularly for feature learning under anisotropic covariates. By analyzing two canonical models—linear representation learning and single-index learning—the authors show that SGD exhibits provable inefficiencies when inputs deviate from isotropy, and that a principled KFAC-style preconditioner yields a condition-number-free convergence and improved feature learning. They derive stylized KFAC updates with concrete contraction guarantees, demonstrate that full second-order methods underperform these layer-wise preconditioners, and provide comprehensive numerical validation across transfer learning and anisotropy settings. The results highlight a meaningful connection between optimization geometry and provable feature learning, with implications for practical training and generalization in deep networks.

Abstract

Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models, linear representation learning and single-index learning, which are widely used to study how typical algorithms efficiently learn useful features to enable generalization. In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs and well-conditioned settings typically assumed in prior work. We demonstrate theoretically and numerically that this suboptimality is fundamental, and that layer-wise preconditioning emerges naturally as the solution. We further show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues, supporting the unique benefits of layer-wise preconditioning.

Paper Structure

This paper contains 53 sections, 31 theorems, 170 equations, 6 figures.

Key Result

Lemma 3.1

Given $\mathbf{G}$, define the least-squares estimator: Given $\eta_{\mathbf{F}} \in (0, 1]$, then the $\mathbf{F}$-update in eq:KFAC_update can be re-written as an EMA of $\widehat{\mathbf{F}}_{\mathrm{ls}}$; i.e.,

Figures (6)

  • Figure 1: From left to right: the training loss, subspace distance, and transfer loss induced by various algorithms on a linear representation learning task. We note that various algorithms converge in training loss, but negligibly in subspace distance, and thus transfer loss.
  • Figure 2: Subspace distance and the training loss of KFAC and AMGD (with and without batch-norm). Notably, batch-norm enables AMGD's train loss to converge, but not its subspace distance.
  • Figure 3: The correlation of the direction learned by SGD and KFAC with the the true direction by numerical simulations averaged over 30 trials, and theoretical predictions. (Left) For different values of $\lambda_{\mathbf{G}}$ the theoretical predictions match the simulations very well. (Right) The alignment of the feature learned by SGD deteriorates as anisotropy is increased (larger ${\varepsilon}$), whereas the KFAC update remains accurate.
  • Figure 4: The effect of batch normalization (on AMGD) vs. KFAC in our experiment settings (Left) Uniform with low anisotropy. (Middle) Gaussian with low anisotropy. (Right) Gaussian with high anisotropy.
  • Figure 5: The subspace distance of representations learned by different algorithms after $1000$ iterations and the true representation as a function of learning rate.
  • ...and 1 more figures

Theorems & Definitions (48)

  • Lemma 3.1
  • Definition 3.2: Subspace Distance stewart1990matrix
  • Remark 3.4: Multi-task Learning
  • Proposition 3.4
  • Theorem 3.5
  • Lemma 3.5
  • Theorem 3.7
  • Lemma 3.7
  • Corollary 3.8
  • Lemma 3.8
  • ...and 38 more