Table of Contents
Fetching ...

$μ$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

Benjamin Thérien, Charles-Étienne Joseph, Boris Knyazev, Edouard Oyallon, Irina Rish, Eugene Belilovsky

TL;DR

This work tackles the meta-generalization gap of learned optimizers by introducing a width-aware Maximal Update Parametrization ($\mu$P) for LOs and a cost-effective meta-training recipe for $\mu$LOs. The authors derive $\mu$P for two LO architectures (VeLO and small\_fc\_lopt), proving LLN-based scaling ensures stable, nondivergent updates as width grows. Empirically, $\mu$LOs outperform standard-parameterization LOs and hand-tuned baselines on unseen, wider tasks, and they unexpectedly generalize to deeper networks and much longer training horizons. They show that the improvements hold across multiple task families with wide and varied architectures, suggesting a practical route to scalable, generalizable LO meta-training with zero extra compute relative to SP baselines.

Abstract

Learned optimizers (LOs) have the potential to significantly reduce the wall-clock training time of neural networks. However, they can struggle to optimize unseen tasks (\emph{meta-generalize}), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization ($μ$P) for two state-of-the-art learned optimizer architectures and propose a simple meta-training recipe for $μ$-parameterized LOs ($μ$LOs). Our empirical evaluation demonstrates that LOs meta-trained with our recipe substantially improve meta-generalization to wider unseen tasks when compared to LOs trained under standard parametrization (SP) using the same compute budget. We also empirically observe that $μ$LOs exhibit unexpectedly improved meta-generalization to deeper networks ($5\times$ meta-training) and surprising generalization to much longer training horizons ($25\times$ meta-training) when compared to SP LOs.

$μ$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

TL;DR

This work tackles the meta-generalization gap of learned optimizers by introducing a width-aware Maximal Update Parametrization (P) for LOs and a cost-effective meta-training recipe for LOs. The authors derive P for two LO architectures (VeLO and small\_fc\_lopt), proving LLN-based scaling ensures stable, nondivergent updates as width grows. Empirically, LOs outperform standard-parameterization LOs and hand-tuned baselines on unseen, wider tasks, and they unexpectedly generalize to deeper networks and much longer training horizons. They show that the improvements hold across multiple task families with wide and varied architectures, suggesting a practical route to scalable, generalizable LO meta-training with zero extra compute relative to SP baselines.

Abstract

Learned optimizers (LOs) have the potential to significantly reduce the wall-clock training time of neural networks. However, they can struggle to optimize unseen tasks (\emph{meta-generalize}), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization (P) for two state-of-the-art learned optimizer architectures and propose a simple meta-training recipe for -parameterized LOs (LOs). Our empirical evaluation demonstrates that LOs meta-trained with our recipe substantially improve meta-generalization to wider unseen tasks when compared to LOs trained under standard parametrization (SP) using the same compute budget. We also empirically observe that LOs exhibit unexpectedly improved meta-generalization to deeper networks ( meta-training) and surprising generalization to much longer training horizons ( meta-training) when compared to SP LOs.
Paper Structure (48 sections, 6 theorems, 28 equations, 12 figures, 10 tables)

This paper contains 48 sections, 6 theorems, 28 equations, 12 figures, 10 tables.

Key Result

Proposition 4.1

Assume that the Learned Optimizer $f_\phi$ has the form $small\_fc\_lopt$ is fed with features given in Appendix apdx:sec:smallfc and that during training the optimizee's parameters and input data become aligned, leading to Law of Large Numbers (LLN) scaling, then the update, initialization, and pre

Figures (12)

  • Figure 1: Meta-generalization is severely limited without our approach. Subfigure (a) illustrates meta-generalization axes by distinguishing between meta-training tasks used herein (blue) and out-of-distribution tasks (red). Subfigure (b) reports the average rank across tasks within our evaluation suite that are out-of-distribution with respect to the corresponding axis. Both AdamW and $\mu$Adam undergo task-specific hyperparameter tuning across more than $500$ configurations per task. Learned Optimizers of the same architecture are meta-learned on the same tasks with a FLOP-matched budget.
  • Figure 2: Layer 2 pre-activations behave harmoniously in $\mu$P for $\mu$LOs and $\mu$Adam alike. We report the evolution of coordinate-wise standard deviation of the difference between the initial ($t=0$) and $t$-th second-layer pre-activations of an MLP during training for the first $500$ steps of a single run (the remaining layers behave similarly, see Sec. \ref{['sec:apdx:activations']}). We observe that all models parameterized in $\mu$P enjoy stable coordinates across widths, while the pre-activations of larger-width models in SP blow up after a number of training steps.
  • Figure 3: Generalization beyond meta-training widths is severely limited without our approach. Each point is the average final training loss over $5$ seeds with standard error bars. Subfigures (a) and (b) report the results of our meta-training task ablation on the ImageNet-32 meta-training tasks at 1000 and 5000 steps. Subfigures (c) and (d) report the performance of $\mu$LO$_M$ and $\mu$VeLO$_M$ on OOD datasets.
  • Figure 4: Evaluating generalization to wider networks for different tasks. All optimizers are meta-trained or hyperparameter tuned for $1000$ inner steps (dotted red line), therefore, any optimization beyond $1000$ steps is considered out-of-distribution. We plot average training loss over $5$ seeds with standard error bars. We observe that $\mu\text{LO}_M$ and $\mu$VeLO$_M$generalize smoothly to longer unrolls and all unseen tasks, unlike their SP counterparts which diverge or fail to make progress. $\mu$LOs outperform the extensively tuned AdamW and $\mu$Adam baselines in subfigures (a),(b), match or surpass them in subfigure (c), and exceed or nearly match their performance on far out-of-distribution LM and ViT tasks (subfigures (d) and (e)). Note that all AdamW and $\mu$Adam are tuned on smaller versions of each task, while our $\mu$LOs are only meta-trained on MLP tasks.
  • Figure 5: Evaluating generalization capabilities of $\mu$LOs to deeper networks. Our focus is on comparing the meta-generalization to deeper tasks of $\mu$LOs to SP LOs (all meta-trained exclusively on MLPs). We also report the performance per-task tuned AdamW and $\mu$Adam for reference. Each plot reports average training loss over $5$ seeds with standard error bars. In each case, $\mu$LOs show improved generalization and performance when compared to their SP counterparts.
  • ...and 7 more figures

Theorems & Definitions (12)

  • Proposition 4.1: $small\_fc\_lopt$ $\mu$P
  • Proposition 4.2: VeLO $\mu$P
  • Definition A.1
  • Proposition A.2
  • proof
  • Corollary A.3
  • proof
  • Proposition A.4
  • proof
  • Corollary A.5
  • ...and 2 more