Principled Architecture-aware Scaling of Hyperparameters

Wuyang Chen; Junru Wu; Zhangyang Wang; Boris Hanin

Principled Architecture-aware Scaling of Hyperparameters

Wuyang Chen, Junru Wu, Zhangyang Wang, Boris Hanin

TL;DR

This work addresses how hyperparameters should adapt to neural architecture, challenging the common practice of architecture-agnostic tuning. It develops topology-aware initialization and muP-based learning-rate scaling for DAG networks, deriving that the optimal learning rate scales with the graph’s path structure as $\eta^* \simeq c (\sum_{p=1}^P L_p^3)^{-1/2}$ and that initialization should satisfy $C^{(\ell',\ell)}=\frac{2}{d_{\text{in}}^{(\ell')}}$ to preserve information flow. The framework extends to CNNs via a kernel-size factor, yielding $\eta^* \simeq c (\sum_{p=1}^P L_p^3)^{-1/2} q^{-1}$, and demonstrates its effectiveness across MLPs, CNNs, and NAS benchmarks. Empirically, the authors show that architecture-aware scaling can significantly improve accuracies and even rewrite network rankings in NAS benchmarks, underscoring the need to revisit AutoML comparisons with these training-principle adjustments. Overall, the work highlights that principled, architecture-aware hyperparameters can enhance training stability and fairness in architecture evaluation, with practical implications for NAS and beyond.

Abstract

Training a high-quality deep neural network requires choosing suitable hyperparameters, which is a non-trivial and expensive process. Current works try to automatically optimize or design principles of hyperparameters, such that they can generalize to diverse unseen scenarios. However, most designs or optimization methods are agnostic to the choice of network structures, and thus largely ignore the impact of neural architectures on hyperparameters. In this work, we precisely characterize the dependence of initializations and maximal learning rates on the network architecture, which includes the network depth, width, convolutional kernel size, and connectivity patterns. By pursuing every parameter to be maximally updated with the same mean squared change in pre-activations, we can generalize our initialization and learning rates across MLPs (multi-layer perception) and CNNs (convolutional neural network) with sophisticated graph topologies. We verify our principles with comprehensive experiments. More importantly, our strategy further sheds light on advancing current benchmarks for architecture design. A fair comparison of AutoML algorithms requires accurate network rankings. However, we demonstrate that network rankings can be easily changed by better training networks in benchmarks with our architecture-aware learning rates and initialization.

Principled Architecture-aware Scaling of Hyperparameters

TL;DR

and that initialization should satisfy

to preserve information flow. The framework extends to CNNs via a kernel-size factor, yielding

, and demonstrates its effectiveness across MLPs, CNNs, and NAS benchmarks. Empirically, the authors show that architecture-aware scaling can significantly improve accuracies and even rewrite network rankings in NAS benchmarks, underscoring the need to revisit AutoML comparisons with these training-principle adjustments. Overall, the work highlights that principled, architecture-aware hyperparameters can enhance training stability and fairness in architecture evaluation, with practical implications for NAS and beyond.

Abstract

Paper Structure (30 sections, 1 theorem, 48 equations, 9 figures)

This paper contains 30 sections, 1 theorem, 48 equations, 9 figures.

Introduction
Related Works
Hyperparameter Optimization
Hyperparameter Transfer
Bechmarking Neural Architecture Search
Methods
Definition of DAG Networks
Topology-aware Initialization Scheme
Initialization scaling for DAG.
Topology-aware Learning Rates
Learning rate scaling in DAG.
Learning Rates in DAG Network with Convolutional Layers
Setting.
Learning rate scaling in CNNs.
Implications on Architecture Design
...and 15 more sections

Key Result

Lemma B.1

For $\ell=1,\ldots, L$, we have where

Figures (9)

Figure 1: A neural network's architecture can be represented by a direct acyclic graph (DAG). $x$ is the input, $z^{(1)}, z^{(2)}, \cdots, z^{(L)}$ are pre-activations (vertices), and $z^{(L+1)}$ is the output. $W$ is the layer operation (edge). Our DAG space includes architectures of different connections and layer types ("Linear + ReLU", "Skip-connect", and "Zero"). For example, by disabling some edges (gray) or replacing with identity $\bm{I}$ (skip connection, dashed arrow), a DAG can represent practical networks such as MLP and ResNet he2016deep.
Figure 2: MLPs with different topologies (graph structures). X-axis: "ground truth" maximal learning rates found by grid search. Y-axis: estimated learning rates by our principle (equation \ref{['eq:lr_dag']}). The red line indicates the identity. Based on the "ground truth" maximal learning rate of the basic MLP with $L=1$, we scale up both learning rates and initialization to diverse architectures. The radius of a dot indicates the variance over three random runs. Data: CIFAR-10.
Figure 3: CNNs with different graph structures and kernel sizes. The x-axis shows the "ground truth" maximal learning rates found by grid search. The y-axis shows the estimated learning rates by our principle (equation \ref{['eq:lr_cnn']}). The red line indicates the identity. Based on the "ground truth" maximal learning rate of the CNN with $L=1$ and kernel size as $3$, we scale up both learning rates and initialization to diverse architectures. The radius of a dot indicates the variance over three random runs. Data: CIFAR-10.
Figure 4: We adopt our scaling principles to existing architecture benchmarks. Left column: better accuracy. For different network architectures, we plot the accuracy trained with different scaling principles (y-axis "Re-scaled") against the accuracy trained with a fixed training recipe (x-axis "NAS-Bench-201"). Each dot represents a unique architecture with different layer types and topologies (Figure 1 in dong2020bench). Our principle (blue dots) achieves better accuracy compared to the benchmark (red line $y = x$). Middle column: network rankings in benchmarks are fragile. We compare networks' performance rankings at different top $K \%$ percentiles ($K = 100, 90, \cdots, 10, 5, 1$; bottom right dots represent networks on the top-right in the left column), trained by our method vs. benchmarks, and find better networks are ranked more differently from the benchmark. This indicates current network rankings in benchmarks (widely used to compare NAS algorithms) can be easily broken by simply better train networks. Right column: less distinguishable architectures. We plot the pairwise performance gaps between different architectures, and find our principle makes networks similar and less distinguishable in terms of their accuracies.
Figure 5: MLP networks of different depths on CIFAR-10. X-axis shows the "ground truth" maximal learning rates found by grid search. The y-axis shows the estimated learning rates by our principle in equation \ref{['eq:lr_dag']}. The red line indicates the identity. Based on the true maximal learning rate of the feedforward networks of $L=3$, we scale up to $L=10$. The radius of a dot indicates the variance over three random runs.
...and 4 more figures

Theorems & Definitions (4)

proof : Derivations for § \ref{['lem:dag_init']}
Lemma B.1: Adapted from Lemma 2.1 in jelassi2023learning
proof
proof : Derivations for § \ref{['thm:lr_cnn']}

Principled Architecture-aware Scaling of Hyperparameters

TL;DR

Abstract

Principled Architecture-aware Scaling of Hyperparameters

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (4)