On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width

Satoki Ishikawa; Ryo Karakida

On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width

Satoki Ishikawa, Ryo Karakida

TL;DR

This paper tackles how to scale second-order optimization to very wide neural networks by developing a μP-inspired ABC-parameterization tailored for K-FAC and Shampoo. By analyzing a one-step update in the infinite-width limit, it derives how to set random initialization, learning rates, and damping so that feature learning remains stable as width grows, and demonstrates that hyperparameters can transfer across widths. The key contributions are the μP formulations for K-FAC and Shampoo, damping-heuristic adjustments for stability, and empirical evidence that learning-rate and damping transfer improve performance on wide models. This parameterization provides a principled framework for applying second-order methods to larger models with fewer hyperparameter searches, advancing practical scalability of second-order optimization.

Abstract

Second-order optimization has been developed to accelerate the training of deep neural networks and it is being applied to increasingly larger-scale models. In this study, towards training on further larger scales, we identify a specific parameterization for second-order optimization that promotes feature learning in a stable manner even if the network width increases significantly. Inspired by a maximal update parameterization, we consider a one-step update of the gradient and reveal the appropriate scales of hyperparameters including random initialization, learning rates, and damping terms. Our approach covers two major second-order optimization algorithms, K-FAC and Shampoo, and we demonstrate that our parameterization achieves higher generalization performance in feature learning. In particular, it enables us to transfer the hyperparameters across models with different widths.

On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width

TL;DR

Abstract

Paper Structure (63 sections, 2 theorems, 86 equations, 26 figures, 6 tables)

This paper contains 63 sections, 2 theorems, 86 equations, 26 figures, 6 tables.

Introduction
Related Work
Preliminaries
Overview of second-order optimization
ABC-parameterization
Preferable scaling of HPs in Second-order optimization
$\mu$P for second-order optimization
Justification of damping heuristics
Implicit bias of K-FAC towards NNGP at zero initialization
Experiments
$\mu$P in wide neural networks
Learning Rate transfer
Damping Transfer
Conclusion
Limitation and future direction.
...and 48 more sections

Key Result

Proposition 4.1

Consider the first one-step update of K-FAC and Shampoo in the infinite width limit. The second-order optimization becomes valid for It admits the $\mu$P for feature learning at where we set $a_l=0$ and setting $e_A=e_B=e$ corresponds to Shampoo.

Figures (26)

Figure 1: $\mu$P achieves feature learning across the width. In SP (Pytorch's Default), $\Delta h_l$ in each layer exhibits dependence on the width. For K-FAC, the default setting of the damping (heuristics) does not satisfy the condition of $\mu$P and we need to utilize the rescaled one as is explained in Section \ref{['sec4-2']}. We train 3-layer MLP with CIFAR10 in the first line and Myrtle-5 with CIFAR10 in the second line. This result does not depend on exponential moving averages or activation (Appendix.\ref{['sec:D']}).
Figure 2: The obtained parameterization is consistent throughout the training. The order of the curvature matrix in K-FAC does not change with time. The input layer is proportional to $M$ whereas the output layer is proportional to $1/M$, which is a natural order in terms of the $\mu$P of the SGD. We trained a 3-layer MLP on FashionMNIST dataset.
Figure 3: K-FAC converges to the NNGP solution when the variance of the last layer is close to zero. When $b_L$ in the last layer is increased (other parameters are fixed to $\mu$P), K-FAC can converge to the NNGP solution in one step. Therefore, when $b_L$ is increased, it converges to a kernel solution, which limits $b$ in which feature learning can occur.
Figure 4: Wider models learn well under $\mu$P throughout training. Using $\mu$P, training proceeds equally across widths. In $\mu$P, the loss is lower for wider widths throughout training. (Left) We trained CBOW on WikiText2 by Shampoo with various widths. (Right) We trained ResNet18 on CIFAR100 by K-FAC while increasing the number of channels from 1 to 16.
Figure 5: $\mu$P consistently achieves higher accuracy for various learning rates. (ResNet50 on ImageNet)
...and 21 more figures

Theorems & Definitions (3)

Proposition 4.1: $\mu$ P of second-order parameterization
Definition A.6: Valid second-order optimization
Lemma A.7: e.g., petersen2008matrix

On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width

TL;DR

Abstract

On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (26)

Theorems & Definitions (3)