On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width
Satoki Ishikawa, Ryo Karakida
TL;DR
This paper tackles how to scale second-order optimization to very wide neural networks by developing a μP-inspired ABC-parameterization tailored for K-FAC and Shampoo. By analyzing a one-step update in the infinite-width limit, it derives how to set random initialization, learning rates, and damping so that feature learning remains stable as width grows, and demonstrates that hyperparameters can transfer across widths. The key contributions are the μP formulations for K-FAC and Shampoo, damping-heuristic adjustments for stability, and empirical evidence that learning-rate and damping transfer improve performance on wide models. This parameterization provides a principled framework for applying second-order methods to larger models with fewer hyperparameter searches, advancing practical scalability of second-order optimization.
Abstract
Second-order optimization has been developed to accelerate the training of deep neural networks and it is being applied to increasingly larger-scale models. In this study, towards training on further larger scales, we identify a specific parameterization for second-order optimization that promotes feature learning in a stable manner even if the network width increases significantly. Inspired by a maximal update parameterization, we consider a one-step update of the gradient and reveal the appropriate scales of hyperparameters including random initialization, learning rates, and damping terms. Our approach covers two major second-order optimization algorithms, K-FAC and Shampoo, and we demonstrate that our parameterization achieves higher generalization performance in feature learning. In particular, it enables us to transfer the hyperparameters across models with different widths.
