Table of Contents
Fetching ...

$μ$pscaling small models: Principled warm starts and hyperparameter transfer

Yuxin Ma, Nan Chen, Mateo Díaz, Soufiane Hayou, Dmitriy Kunisky, Soledad Villar

TL;DR

This work tackles efficient multi-size model deployment by principled width upscaling, enabling warm-starts for larger models while preserving training dynamics. It builds a theory of static and dynamic equivalence across widths within a unified Tensor Program/Ne$\otimes$or$\top$ framework and links widening to $\mu$P, yielding a practical upscaling algorithm that injects width-aware noise and enables near-zero-shot hyperparameter transfer. By analyzing the infinite-width limit with Tensor Programs, the authors characterize how upscaled training behaves and how hyperparameters transfer across widths. Empirically, they show faster convergence and competitive performance across MLPs, ResNets, and GPT-2, while also highlighting architecture-dependent limitations and the need for careful noise and hyperparameter tuning during upscaling.

Abstract

Modern large-scale neural networks are often trained and released in multiple sizes to accommodate diverse inference budgets. To improve efficiency, recent work has explored model upscaling: initializing larger models from trained smaller ones in order to transfer knowledge and accelerate convergence. However, this method can be sensitive to hyperparameters that need to be tuned at the target upscaled model size, which is prohibitively costly to do directly. It remains unclear whether the most common workaround -- tuning on smaller models and extrapolating via hyperparameter scaling laws -- is still sound when using upscaling. We address this with principled approaches to upscaling with respect to model widths and efficiently tuning hyperparameters in this setting. First, motivated by $μ$P and any-dimensional architectures, we introduce a general upscaling method applicable to a broad range of architectures and optimizers, backed by theory guaranteeing that models are equivalent to their widened versions and allowing for rigorous analysis of infinite-width limits. Second, we extend the theory of $μ$Transfer to a hyperparameter transfer technique for models upscaled using our method and empirically demonstrate that this method is effective on realistic datasets and architectures.

$μ$pscaling small models: Principled warm starts and hyperparameter transfer

TL;DR

This work tackles efficient multi-size model deployment by principled width upscaling, enabling warm-starts for larger models while preserving training dynamics. It builds a theory of static and dynamic equivalence across widths within a unified Tensor Program/Neor framework and links widening to P, yielding a practical upscaling algorithm that injects width-aware noise and enables near-zero-shot hyperparameter transfer. By analyzing the infinite-width limit with Tensor Programs, the authors characterize how upscaled training behaves and how hyperparameters transfer across widths. Empirically, they show faster convergence and competitive performance across MLPs, ResNets, and GPT-2, while also highlighting architecture-dependent limitations and the need for careful noise and hyperparameter tuning during upscaling.

Abstract

Modern large-scale neural networks are often trained and released in multiple sizes to accommodate diverse inference budgets. To improve efficiency, recent work has explored model upscaling: initializing larger models from trained smaller ones in order to transfer knowledge and accelerate convergence. However, this method can be sensitive to hyperparameters that need to be tuned at the target upscaled model size, which is prohibitively costly to do directly. It remains unclear whether the most common workaround -- tuning on smaller models and extrapolating via hyperparameter scaling laws -- is still sound when using upscaling. We address this with principled approaches to upscaling with respect to model widths and efficiently tuning hyperparameters in this setting. First, motivated by P and any-dimensional architectures, we introduce a general upscaling method applicable to a broad range of architectures and optimizers, backed by theory guaranteeing that models are equivalent to their widened versions and allowing for rigorous analysis of infinite-width limits. Second, we extend the theory of Transfer to a hyperparameter transfer technique for models upscaled using our method and empirically demonstrate that this method is effective on realistic datasets and architectures.
Paper Structure (75 sections, 10 theorems, 132 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 75 sections, 10 theorems, 132 equations, 6 figures, 3 tables, 2 algorithms.

Key Result

Proposition 2.1

Consider a base MLP with weight matrices $(W^{(\ell)} \in \mathbb{R}^{n_{\ell} \times n_{\ell-1}})_{\ell=1}^{L}$. Construct a widened MLP that uses the same activation function and preserves the input and output dimensions, with weights obtained by duplicating and rescaling those of the base MLP as where $N_\ell = k_\ell n_\ell$ for all $\ell$ and the width multipliers $k_\ell \in \mathbb{N}$ sat

Figures (6)

  • Figure 1: Illustration of upscaling method and hyperparameter transfer method.
  • Figure 2: Training (top row) and validation (bottom row) performance for MLP, ResNet, and GPT-2. The y-axes are truncated to highlight differences between the two curves in each panel. For the MLP and ResNet experiments which have training instability, plots show the mean over five random runs, with shaded min--max bands. More details and additional results are deferred to Appendix \ref{['appen:exp']}.
  • Figure 3: Hyperparameter transfer for the upscaled model. Columns (left to right): GPT-2 with AdamW, MLP with SGD, and MLP with AdamW. For MLP experiments, curves report the mean across five runs, with min--max ranges across random seeds. In (f), more widths are evaluated than in (c) because of the slightly noisy behavior at $N=1024$.
  • Figure 4: Training and validation curves comparing upscaling to training from scratch for an MLP trained with SGD and weight decay. All models have width $k n = 2000$. Curves show the mean across five random runs, with ranges spanning the minimum to the maximum across runs. The y-axes are truncated to highlight differences between the two curves.
  • Figure 5: Training and validation curves comparing upscaling to training from scratch for an MLP trained with AdamW. All models have width $k n = 2000$. Curves show the mean across five random runs, with ranges spanning the minimum to the maximum across runs. The y-axes are truncated to highlight differences between the two curves.
  • ...and 1 more figures

Theorems & Definitions (28)

  • Proposition 2.1: Static equivalence of MLPs
  • proof
  • Proposition 2.2: Dynamic equivalence of MLPs trained with SGD
  • proof
  • Definition 2.3: Entrywise optimizer with weight decay
  • Proposition 2.4: Dynamic equivalence of MLPs trained with general optimizers
  • Theorem 2.5: Informal
  • proof : Proof of Proposition \ref{['prop:entrywise_update']}
  • Example C.1: SGD with and without momentum
  • Example C.2: Adam and AdamW
  • ...and 18 more