Meta-Principled Family of Hyperparameter Scaling Strategies

Sho Yaida

Meta-Principled Family of Hyperparameter Scaling Strategies

Sho Yaida

TL;DR

To address principled scaling of hyperparameters as networks grow, the paper unifies neural-tangent (NT) and maximal-update (MU) scaling into a one-parameter meta-family indexed by s in [0,1]. By enforcing forward criticality and gradient-equivalence, it derives explicit p_l, q_l, r scalings and shows the emergent representation-learning scale $\\gamma = L/n^{1-s}$, which governs the leading changes in the NTK and its diffferentials during training. The authors analyze NTK, dNTK, and ddNTK dynamics to reveal how representation learning persists or evolves under different depth-width regimes, delineating a web of finite-width theories that connect traditional infinite-width limits. These insights provide principled guidance for scaling depth with width to preserve representation-learning capability in large-scale models.

Abstract

In this note, we first derive a one-parameter family of hyperparameter scaling strategies that interpolates between the neural-tangent scaling and mean-field/maximal-update scaling. We then calculate the scalings of dynamical observables -- network outputs, neural tangent kernels, and differentials of neural tangent kernels -- for wide and deep neural networks. These calculations in turn reveal a proper way to scale depth with width such that resultant large-scale models maintain their representation-learning ability. Finally, we observe that various infinite-width limits examined in the literature correspond to the distinct corners of the interconnected web spanned by effective theories for finite-width neural networks, with their training dynamics ranging from being weakly-coupled to being strongly-coupled.

Meta-Principled Family of Hyperparameter Scaling Strategies

TL;DR

, which governs the leading changes in the NTK and its diffferentials during training. The authors analyze NTK, dNTK, and ddNTK dynamics to reveal how representation learning persists or evolves under different depth-width regimes, delineating a web of finite-width theories that connect traditional infinite-width limits. These insights provide principled guidance for scaling depth with width to preserve representation-learning capability in large-scale models.

Abstract

Paper Structure (13 sections, 45 equations)

This paper contains 13 sections, 45 equations.

Introduction
Review and $p_{\ell}q_{\ell}r$ Scaling Strategies
Deriving a Meta-Principled Family
The Principle of Criticality at Meta
The Principle of Learning-Rate Equivalence at Meta
Meta-Principled Family
Computing the Degree of Representation Learning
dNTK
ddNTKs
Casting the Web of Effective Theories
Peeking into the Hierarchical Structure
Type-I Higher-Order Differentials
General Types of Higher-Order Differentials

Meta-Principled Family of Hyperparameter Scaling Strategies

TL;DR

Abstract

Meta-Principled Family of Hyperparameter Scaling Strategies

Authors

TL;DR

Abstract

Table of Contents