Table of Contents
Fetching ...

Celo2: Towards Learned Optimization Free Lunch

Abhinav Moudgil, Boris Knyazev, Eugene Belilovsky

TL;DR

By crafting a simple normalized optimizer architecture and augmenting meta-training, it becomes feasible to meta-train a performant general-purpose learned update rule on a tiny fraction of VeLO compute, 4.5 GPU hours to be precise.

Abstract

Learned optimizers are powerful alternatives to hand-designed update rules like Adam, yet they have seen limited practical adoption since they often fail to meta-generalize beyond their training distribution and incur high meta-training cost. For instance, prior work, VeLO, scaled meta-training to 4,000 TPU months ($\sim$10$\times$ GPT-3 compute) to meta-train a general-purpose optimizer but it failed to generalize beyond 600M parameters tasks. In this work, we present a surprising finding: by crafting a simple normalized optimizer architecture and augmenting meta-training, it becomes feasible to meta-train a performant general-purpose learned update rule on a tiny fraction of VeLO compute, 4.5 GPU hours to be precise. Our learned update rule scales stably to a billion-scale pretraining task (GPT-3 XL 1.3B) which is six orders of magnitude larger than its meta-training distribution. Furthermore, it shows strong performance across diverse out-of-distribution tasks and is compatible with modern optimization harness that includes orthogonalization, distinct update rules for input-output and hidden weights, and decoupled weight decay. In all, this work paves the way for practically applicable learnable optimization algorithms, unlocking exploration of richer meta-training and data curation recipes to further improve performance.

Celo2: Towards Learned Optimization Free Lunch

TL;DR

By crafting a simple normalized optimizer architecture and augmenting meta-training, it becomes feasible to meta-train a performant general-purpose learned update rule on a tiny fraction of VeLO compute, 4.5 GPU hours to be precise.

Abstract

Learned optimizers are powerful alternatives to hand-designed update rules like Adam, yet they have seen limited practical adoption since they often fail to meta-generalize beyond their training distribution and incur high meta-training cost. For instance, prior work, VeLO, scaled meta-training to 4,000 TPU months (10 GPT-3 compute) to meta-train a general-purpose optimizer but it failed to generalize beyond 600M parameters tasks. In this work, we present a surprising finding: by crafting a simple normalized optimizer architecture and augmenting meta-training, it becomes feasible to meta-train a performant general-purpose learned update rule on a tiny fraction of VeLO compute, 4.5 GPU hours to be precise. Our learned update rule scales stably to a billion-scale pretraining task (GPT-3 XL 1.3B) which is six orders of magnitude larger than its meta-training distribution. Furthermore, it shows strong performance across diverse out-of-distribution tasks and is compatible with modern optimization harness that includes orthogonalization, distinct update rules for input-output and hidden weights, and decoupled weight decay. In all, this work paves the way for practically applicable learnable optimization algorithms, unlocking exploration of richer meta-training and data curation recipes to further improve performance.
Paper Structure (13 sections, 8 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 13 sections, 8 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 2: ImageNet classification with ViT. We test our learned update rule, Celo2, on the ImageNet classification task with batch size 512 and 50K steps, which is 25$\times$ longer than its meta-training unroll length and 30,000$\times$ larger than the tasks seen during training. Since VeLO is trained with final loss as the meta-objective, it shows non-trivial dynamics during training in order to achieve low final loss (see norm plots). Celo2 achieves VeLO's final loss within $\sim$50% steps. As test accuracy reaches $\sim$66%, all optimizers start overfitting in this task; this is consistent with findings in prior work dahl2023benchmarking. Since our update rule is normalized, it shows training norm dynamics consistent with AdamW. Moreover, VeLO is meta-trained with 200K unroll length on a large number of diverse tasks including ViTs and ImageNet dataset, whereas Celo2 is only meta-trained on small image MLP tasks (§\ref{['sec:exps']}), which highlights its strong meta-generalization capability.
  • Figure 3: Reinforcement Learning. We directly evaluate our learned optimizer, Celo2, on Atari RL tasks using the PPO algorithm to learn the RL policy. Our results clearly indicate that Celo2 performs at par with a well-tuned AdamW baseline on these out-of-distribution tasks, while the VeLO baseline stagnates at a much lower return. The latter result can be corroborated by Figure 11 in metz2022velo.
  • Figure 4: As shown in the figure on the left, Celo2-base that uses a simple learned MLP rule for all parameters without orthogonalization or AdamW, is able to scale stably on GPT-2 task. We find that both techniques (1) Orthogonalization and (2) Adam for 1D params improve performance when applied on top of Celo2-base. Applying these two techniques directly at test-time improves performance but meta-training with them is even better.
  • Figure 5: Validation loss curves for Celo2 with various learning rates uniformly sampled on log-scale between 1e-5 and 1e-3 on LM (30M) FineWeb-Edu dataset.
  • Figure 6: Parallel coordinate plot showing Celo2 hyperparameter sensitivity on SpaceInvaders RL task. Yellow curves indicate high-return configurations.
  • ...and 2 more figures