Narrowing the Focus: Learned Optimizers for Pretrained Models

Gus Kristiansen; Mark Sandler; Andrey Zhmoginov; Nolan Miller; Anirudh Goyal; Jihwan Lee; Max Vladymyrov

Narrowing the Focus: Learned Optimizers for Pretrained Models

Gus Kristiansen, Mark Sandler, Andrey Zhmoginov, Nolan Miller, Anirudh Goyal, Jihwan Lee, Max Vladymyrov

TL;DR

This work proposes a novel optimizer technique that learns a layer-specific linear combination of update directions provided by a set of base optimizers, effectively adapting its strategy to the specific model and dataset.

Abstract

In modern deep learning, the models are learned by applying gradient updates using an optimizer, which transforms the updates based on various statistics. Optimizers are often hand-designed and tuning their hyperparameters is a big part of the training process. Learned optimizers have shown some initial promise, but are generally unsuccessful as a general optimization mechanism applicable to every problem. In this work we explore a different direction: instead of learning general optimizers, we instead specialize them to a specific training environment. We propose a novel optimizer technique that learns a layer-specific linear combination of update directions provided by a set of base optimizers, effectively adapting its strategy to the specific model and dataset. When evaluated on image classification tasks, this specialized optimizer significantly outperforms both traditional off-the-shelf methods such as Adam, as well as existing general learned optimizers. Moreover, it demonstrates robust generalization with respect to model initialization, evaluating on unseen datasets, and training durations beyond its meta-training horizon.

Narrowing the Focus: Learned Optimizers for Pretrained Models

TL;DR

Abstract

Paper Structure (26 sections, 2 equations, 10 figures, 16 tables)

This paper contains 26 sections, 2 equations, 10 figures, 16 tables.

Introduction
Preliminaries
Motivation and Related Work
L3RS: Learned Layer-wise Learning Rate Scheduler
Model architecture
Adaptive Exponential Moving Averages (EMA).
Time features.
Embedding features.
Base-optimizer direction magnitude.
Comparison with VeLO
Experiments
Fine-Tuning.
Model Choices.
Task Distribution and Meta-Training.
Meta-Evaluation and Baselines.
...and 11 more sections

Figures (10)

Figure 1: Left: Inner loop evaluation. Given a task $T := \left \{\theta_0, \{D^t_K\}, D^e \right \}$, the learned optimizer $f_{\psi}$ uses optimization statistics $\Phi$ to update model parameters starting from $\theta_0$ for each of $\{D^t_K\}$ batches. After $K$ update steps, the final model parameters are evaluated on the evaluation set $D^e$ using meta-loss $L_\psi(\theta_K, D^e)$. Right: Outer loop NES meta-training iteration. Given a task $T$ and meta-parameters $\psi_T$, Gaussian noise is added to $\psi_t$ to produce a number of candidates equal to the population size $c$, $(\psi_{t,0}, ..., \psi_{t,c})$. An inner loop evaluation is performed on the given task for all candidates. The fitness of each candidate is then used to perform an NES update step, resulting in the next learned optimizer parameters, $\psi_{t+1}$.
Figure 2: Left:L3RS applied to a single layer of the target model. The MLP receives time features, EMA features of the global loss, layer gradient norm and layer parameter norm, the target layer embedding, and the norm of each direction provided. The MLP outputs the weighting for each direction ($\mu_p$) as well as the final update norm ($\lambda$). Right:L3RS is applied at every layer of the network independently with the same MLP weights but different layer embeddings.
Figure 3: Meta-evaluation of L3RS meta-trained on ImageNet25 for $10$ to $500$ steps along with various benchmarks. Performance is compared to VeLO (left), Adam with cosine learning rate (center), and Adam with a constant learning rate (right). Each marker represents model evaluation at that number of steps. Solid lines indicate the number of steps for in-distribution evaluation, while dashed lines indicate generalization to more steps than meta-training. A. In-domain Generalization. Both initialization and evaluation are on ImageNet. B. Out-of-Domain Initialization. Initialized on ImageNet, evaluated on Places25Eval. C. Out-of-Domain Evaluation. Initialized on Places, evaluated on ImageNet25Eval. D. Out-of-Domain Init & Eval. Both initialization and evaluation are on Places dataset. E. Random Initialization. Random initialization, evaluated on ImageNet25Eval dataset. F. Speedup of L3RS in iterations. For in-domain generalization, this shows how much faster L3RS achieves a given accuracy compared to the baselines.
Figure 4: Visualization of learned mixing coefficients $\mu^{(l)}$ and per-layer learning rates $\lambda^{(l)}$ over 100 steps for a ResNet-34 model. Each layer's type and component are distinguished by color and line type (see legend). The general trend shows curves moving up and to the left, indicating a transition from Adam ($\mu^{(l)} = 0$) to SGD ($\mu^{(l)} = 1$) and a decrease in $\lambda^{(l)}$. The initial step is marked with a $\bullet$ and the final step with a $\star$.
Figure 5: Average learning rate vs. direction mix between Adam and SGD for each of the 100 steps of the L3RS optimizer.
...and 5 more figures

Narrowing the Focus: Learned Optimizers for Pretrained Models

TL;DR

Abstract

Narrowing the Focus: Learned Optimizers for Pretrained Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)