Overshoot: Taking advantage of future gradients in momentum-based stochastic optimization

Jakub Kopal; Michal Gregor; Santiago de Leon-Martinez; Jakub Simko

Overshoot: Taking advantage of future gradients in momentum-based stochastic optimization

Jakub Kopal, Michal Gregor, Santiago de Leon-Martinez, Jakub Simko

TL;DR

Overshoot introduces a momentum-based optimization that computes gradients at overshot weights to exploit future gradient information. It unifies CM, NAG, and vanilla SGD, and provides efficient SGDO and AdamO implementations with zero memory overhead. Across a diverse set of tasks, Overshoot accelerates convergence and improves final generalization, though the optimal overshoot factor is task-dependent. Limitations include limited theoretical guarantees and the need for adaptive gamma strategies in practice.

Abstract

Overshoot is a novel, momentum-based stochastic gradient descent optimization method designed to enhance performance beyond standard and Nesterov's momentum. In conventional momentum methods, gradients from previous steps are aggregated with the gradient at current model weights before taking a step and updating the model. Rather than calculating gradient at the current model weights, Overshoot calculates the gradient at model weights shifted in the direction of the current momentum. This sacrifices the immediate benefit of using the gradient w.r.t. the exact model weights now, in favor of evaluating at a point, which will likely be more relevant for future updates. We show that incorporating this principle into momentum-based optimizers (SGD with momentum and Adam) results in faster convergence (saving on average at least 15% of steps). Overshoot consistently outperforms both standard and Nesterov's momentum across a wide range of tasks and integrates into popular momentum-based optimizers with zero memory and small computational overhead.

Overshoot: Taking advantage of future gradients in momentum-based stochastic optimization

TL;DR

Abstract

Paper Structure (18 sections, 18 equations, 4 figures, 3 tables, 3 algorithms)

This paper contains 18 sections, 18 equations, 4 figures, 3 tables, 3 algorithms.

Introduction
Method
Efficient implementation for SGD
Efficient implementation for Adam
Related Work
Overshoot Properties
Momentum Unification
Gradients relevance
Gradient weight decay
Experiments
Hyper-parameters
Tasks
Results
Training loss convergence
Generalization (model performance)
...and 3 more sections

Figures (4)

Figure 1: Overshoot derives gradients from overshot model weights $\theta^{\prime}$, instead of from base weights $\theta$. The overshoot weights are "future model weights" estimations, computed by extending previous model updates by a factor of $\gamma$. This way, past gradients become more relevant to the current model weights, hence faster convergence. Consider the situation at $\theta_{t+4}$: computing the next step will use gradients coming from a more representative "neighborhood" group of overshot models (red circles) instead of a less representative "tail" of past base models (gray points).
Figure 2: Overshoot for various $\gamma$ and $\mu$ settings. Negative momentum aggregates past gradients with an inverted sign. Arguably, this is not the intended behavior of a momentum based optimizer. *Estimated by minimizing \ref{['eq:distance_to_min']}, using SGDO to generate a series of paths with 30,000 steps and randomly sampled gradients from $\{g \in \mathbb{R}^{20} : ||g|| = 1\}$.
Figure 3: The average training and test losses, computed over 10 runs with different random seeds. Training losses are smoothed using a one-dimensional Gaussian filter. Obtained using the base model weights: $\theta_t - \gamma\hat{\theta}_t$. We employ a shifted logarithmic y-axis scale to visually separate small absolute differences.
Figure 4: Relation between $Awd$\ref{['eq:distance_to_min']} and training loss (AUC) is analyzed using AdamO for $\gamma \in \{0, 1.. 15\}$ and $\beta_1 \in \{0.9, 0.95\}$. The training loss is visualized using a colorbar that is specific to each subgraph (lower is better). Note that AdamO with $\gamma = 0$ corresponds to the vanilla Adam optimizer. The $Awd$ is estimated by considering the distance to the past 50 model weights, sampled at every 50th training step. Training loss is computed based on the base model weights: $\theta_t - \gamma\hat{\theta}_t$. For $\beta_1=0.9: \mathop{\mathrm{arg\,min}}\limits_{\gamma} Awd(\gamma) \approx 2.5$, and for $\beta_1=0.95: \mathop{\mathrm{arg\,min}}\limits_{\gamma} Awd(\gamma) \approx 5$ across the tasks.

Overshoot: Taking advantage of future gradients in momentum-based stochastic optimization

TL;DR

Abstract

Overshoot: Taking advantage of future gradients in momentum-based stochastic optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (4)