Table of Contents
Fetching ...

How Does Preconditioning Guide Feature Learning in Deep Neural Networks?

Kotaro Yoshida, Atsushi Nitanda

TL;DR

This work reframes preconditioning as a mechanism steering feature learning through a spectrum-biased geometry, showing that all input information relevant to learning is captured by the Gram matrix $G_P^{(t)}$ induced by the preconditioner. By instantiating $P$ as the $p$-th power of the input covariance, $P=oldsymbol{ abla}^p_X$, the resulting Gram decomposition reveals that larger $p$ emphasizes high-variance directions while smaller $p$ emphasizes low-variance directions, and generalization in a single-index teacher setting depends on $p$ and the alignment between the teacher and input spectrum (as well as label noise). Empirical results across robustness to noise, OOD generalization under correlation shift, and forward knowledge transfer demonstrate that matching the spectral bias to the teacher improves performance, while misalignment can hurt generalization; notably, $p=-1$ often yields broad transfer. The findings offer a spectrum-centric view that informs optimizer design and motivates task-aware preconditioning strategies to enhance generalization and transfer in deep networks.

Abstract

Preconditioning is widely used in machine learning to accelerate convergence on the empirical risk, yet its role on the expected risk remains underexplored. In this work, we investigate how preconditioning affects feature learning and generalization performance. We first show that the input information available to the model is conveyed solely through the Gram matrix defined by the preconditioner's metric, thereby inducing a controllable spectral bias on feature learning. Concretely, instantiating the preconditioner as the $p$-th power of the input covariance matrix and within a single-index teacher model, we prove that in generalization, the exponent $p$ and the alignment between the teacher and the input spectrum are crucial factors. We further investigate how the interplay between these factors influences feature learning from three complementary perspectives: (i) Robustness to noise, (ii) Out-of-distribution generalization, and (iii) Forward knowledge transfer. Our results indicate that the learned feature representations closely mirror the spectral bias introduced by the preconditioner -- favoring components that are emphasized and exhibiting reduced sensitivity to those that are suppressed. Crucially, we demonstrate that generalization is significantly enhanced when this spectral bias is aligned with that of the teacher.

How Does Preconditioning Guide Feature Learning in Deep Neural Networks?

TL;DR

This work reframes preconditioning as a mechanism steering feature learning through a spectrum-biased geometry, showing that all input information relevant to learning is captured by the Gram matrix induced by the preconditioner. By instantiating as the -th power of the input covariance, , the resulting Gram decomposition reveals that larger emphasizes high-variance directions while smaller emphasizes low-variance directions, and generalization in a single-index teacher setting depends on and the alignment between the teacher and input spectrum (as well as label noise). Empirical results across robustness to noise, OOD generalization under correlation shift, and forward knowledge transfer demonstrate that matching the spectral bias to the teacher improves performance, while misalignment can hurt generalization; notably, often yields broad transfer. The findings offer a spectrum-centric view that informs optimizer design and motivates task-aware preconditioning strategies to enhance generalization and transfer in deep networks.

Abstract

Preconditioning is widely used in machine learning to accelerate convergence on the empirical risk, yet its role on the expected risk remains underexplored. In this work, we investigate how preconditioning affects feature learning and generalization performance. We first show that the input information available to the model is conveyed solely through the Gram matrix defined by the preconditioner's metric, thereby inducing a controllable spectral bias on feature learning. Concretely, instantiating the preconditioner as the -th power of the input covariance matrix and within a single-index teacher model, we prove that in generalization, the exponent and the alignment between the teacher and the input spectrum are crucial factors. We further investigate how the interplay between these factors influences feature learning from three complementary perspectives: (i) Robustness to noise, (ii) Out-of-distribution generalization, and (iii) Forward knowledge transfer. Our results indicate that the learned feature representations closely mirror the spectral bias introduced by the preconditioner -- favoring components that are emphasized and exhibiting reduced sensitivity to those that are suppressed. Crucially, we demonstrate that generalization is significantly enhanced when this spectral bias is aligned with that of the teacher.

Paper Structure

This paper contains 28 sections, 6 theorems, 34 equations, 7 figures, 2 tables.

Key Result

Theorem 1

Assume that the initial preconditioner $\boldsymbol{P}^{(0)}$ is an arbitrary positive semi-definite matrix initialized independently of $\boldsymbol{W}_1^{(0)}$, and that the first layer is initialized in a $\boldsymbol{P}^{(0)}$-isotropic manner. Then, under Assumptions 1.P and 1.Q, for all $t\ge

Figures (7)

  • Figure 1: The relationship between robustness to noise and preconditioning. (a) and (b) show the results when preconditioning is performed using the exact covariance matrix, for Case High and Low, respectively. The top row presents the final test MSE for each SNR value, while the bottom row shows the train/test MSE trajectories for SNR=1. In Case High, larger $p$ values more effectively prevent overfitting, whereas in Case Low, the opposite trend is observed. (c) and (d) display the results for preconditioning with AdaHessian. Except for the numerically unstable case of $p = -2$, the same trends as in (a) and (b) are observed here as well.
  • Figure 2: The relationship between OOD generalization and preconditioning.(a) Comparison across optimizers (SAM, GD, Adam, Sophia-H, AdaHessian, K-FAC, L-BFGS). (b) AdaHessian with different powers applied to the diagonal Hessian entries. In each panel, the left and right columns correspond to two settings (left: invariant digit with spurious noise; right: invariant noise with a spurious digit). Although ID Val accuracy (gray numbers) is near ceiling for all methods, OOD accuracy varies substantially, and the optimizer ranking reverses between the two settings. Sweeping the power in (b) reproduces the same reversal, indicating that preconditioning steers learning toward different covariance eigen-directions; OOD performance improves when this implicit emphasis aligns with invariant features.
  • Figure 3: The relationship between knowledge transferability and preconditioning. (a), (b) show the case where exact covariance preconditioning with different exponents $p$ is applied in Task1, and (c), (d) show the case where $p$ is swept using AdaHessian. For each case (High, Low), the test MSE of Task1/Task2 is shown. In Task1, generalization improves as $p$ increases in High, and as $p$ decreases in Low. In contrast, in Task2, where only the second layer is optimized while other components are fixed to model trained in Task1, $p = -1$ gives the best result in both cases. Similar trends are reproduced in (c), (d), indicating that the spectral bias formed in Task1 strongly influences knowledge transferability in Task2.
  • Figure 4: Training and test performance trajectories for each SNR level with the exact covariance preconditioner for different $p$ on Case High
  • Figure 5: Training and test performance trajectories for each SNR level with the exact covariance preconditioner for different $p$ on Case Low
  • ...and 2 more figures

Theorems & Definitions (7)

  • Theorem 1: Extension of Theorem 2.1.1 of wadia2021whitening
  • Theorem 2: Extension of Theorem 2.2.1 of wadia2021whitening
  • Proposition 1
  • proof
  • Lemma 1
  • Lemma 2
  • Lemma 3