$μ$LO: Compute-Efficient Meta-Generalization of Learned Optimizers
Benjamin Thérien, Charles-Étienne Joseph, Boris Knyazev, Edouard Oyallon, Irina Rish, Eugene Belilovsky
TL;DR
This work tackles the meta-generalization gap of learned optimizers by introducing a width-aware Maximal Update Parametrization ($\mu$P) for LOs and a cost-effective meta-training recipe for $\mu$LOs. The authors derive $\mu$P for two LO architectures (VeLO and small\_fc\_lopt), proving LLN-based scaling ensures stable, nondivergent updates as width grows. Empirically, $\mu$LOs outperform standard-parameterization LOs and hand-tuned baselines on unseen, wider tasks, and they unexpectedly generalize to deeper networks and much longer training horizons. They show that the improvements hold across multiple task families with wide and varied architectures, suggesting a practical route to scalable, generalizable LO meta-training with zero extra compute relative to SP baselines.
Abstract
Learned optimizers (LOs) have the potential to significantly reduce the wall-clock training time of neural networks. However, they can struggle to optimize unseen tasks (\emph{meta-generalize}), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization ($μ$P) for two state-of-the-art learned optimizer architectures and propose a simple meta-training recipe for $μ$-parameterized LOs ($μ$LOs). Our empirical evaluation demonstrates that LOs meta-trained with our recipe substantially improve meta-generalization to wider unseen tasks when compared to LOs trained under standard parametrization (SP) using the same compute budget. We also empirically observe that $μ$LOs exhibit unexpectedly improved meta-generalization to deeper networks ($5\times$ meta-training) and surprising generalization to much longer training horizons ($25\times$ meta-training) when compared to SP LOs.
