Table of Contents
Fetching ...

Meta-Principled Family of Hyperparameter Scaling Strategies

Sho Yaida

TL;DR

To address principled scaling of hyperparameters as networks grow, the paper unifies neural-tangent (NT) and maximal-update (MU) scaling into a one-parameter meta-family indexed by s in [0,1]. By enforcing forward criticality and gradient-equivalence, it derives explicit p_l, q_l, r scalings and shows the emergent representation-learning scale $\\gamma = L/n^{1-s}$, which governs the leading changes in the NTK and its diffferentials during training. The authors analyze NTK, dNTK, and ddNTK dynamics to reveal how representation learning persists or evolves under different depth-width regimes, delineating a web of finite-width theories that connect traditional infinite-width limits. These insights provide principled guidance for scaling depth with width to preserve representation-learning capability in large-scale models.

Abstract

In this note, we first derive a one-parameter family of hyperparameter scaling strategies that interpolates between the neural-tangent scaling and mean-field/maximal-update scaling. We then calculate the scalings of dynamical observables -- network outputs, neural tangent kernels, and differentials of neural tangent kernels -- for wide and deep neural networks. These calculations in turn reveal a proper way to scale depth with width such that resultant large-scale models maintain their representation-learning ability. Finally, we observe that various infinite-width limits examined in the literature correspond to the distinct corners of the interconnected web spanned by effective theories for finite-width neural networks, with their training dynamics ranging from being weakly-coupled to being strongly-coupled.

Meta-Principled Family of Hyperparameter Scaling Strategies

TL;DR

To address principled scaling of hyperparameters as networks grow, the paper unifies neural-tangent (NT) and maximal-update (MU) scaling into a one-parameter meta-family indexed by s in [0,1]. By enforcing forward criticality and gradient-equivalence, it derives explicit p_l, q_l, r scalings and shows the emergent representation-learning scale , which governs the leading changes in the NTK and its diffferentials during training. The authors analyze NTK, dNTK, and ddNTK dynamics to reveal how representation learning persists or evolves under different depth-width regimes, delineating a web of finite-width theories that connect traditional infinite-width limits. These insights provide principled guidance for scaling depth with width to preserve representation-learning capability in large-scale models.

Abstract

In this note, we first derive a one-parameter family of hyperparameter scaling strategies that interpolates between the neural-tangent scaling and mean-field/maximal-update scaling. We then calculate the scalings of dynamical observables -- network outputs, neural tangent kernels, and differentials of neural tangent kernels -- for wide and deep neural networks. These calculations in turn reveal a proper way to scale depth with width such that resultant large-scale models maintain their representation-learning ability. Finally, we observe that various infinite-width limits examined in the literature correspond to the distinct corners of the interconnected web spanned by effective theories for finite-width neural networks, with their training dynamics ranging from being weakly-coupled to being strongly-coupled.
Paper Structure (13 sections, 45 equations)