How Does Preconditioning Guide Feature Learning in Deep Neural Networks?
Kotaro Yoshida, Atsushi Nitanda
TL;DR
This work reframes preconditioning as a mechanism steering feature learning through a spectrum-biased geometry, showing that all input information relevant to learning is captured by the Gram matrix $G_P^{(t)}$ induced by the preconditioner. By instantiating $P$ as the $p$-th power of the input covariance, $P=oldsymbol{ abla}^p_X$, the resulting Gram decomposition reveals that larger $p$ emphasizes high-variance directions while smaller $p$ emphasizes low-variance directions, and generalization in a single-index teacher setting depends on $p$ and the alignment between the teacher and input spectrum (as well as label noise). Empirical results across robustness to noise, OOD generalization under correlation shift, and forward knowledge transfer demonstrate that matching the spectral bias to the teacher improves performance, while misalignment can hurt generalization; notably, $p=-1$ often yields broad transfer. The findings offer a spectrum-centric view that informs optimizer design and motivates task-aware preconditioning strategies to enhance generalization and transfer in deep networks.
Abstract
Preconditioning is widely used in machine learning to accelerate convergence on the empirical risk, yet its role on the expected risk remains underexplored. In this work, we investigate how preconditioning affects feature learning and generalization performance. We first show that the input information available to the model is conveyed solely through the Gram matrix defined by the preconditioner's metric, thereby inducing a controllable spectral bias on feature learning. Concretely, instantiating the preconditioner as the $p$-th power of the input covariance matrix and within a single-index teacher model, we prove that in generalization, the exponent $p$ and the alignment between the teacher and the input spectrum are crucial factors. We further investigate how the interplay between these factors influences feature learning from three complementary perspectives: (i) Robustness to noise, (ii) Out-of-distribution generalization, and (iii) Forward knowledge transfer. Our results indicate that the learned feature representations closely mirror the spectral bias introduced by the preconditioner -- favoring components that are emphasized and exhibiting reduced sensitivity to those that are suppressed. Crucially, we demonstrate that generalization is significantly enhanced when this spectral bias is aligned with that of the teacher.
