Careful with that Scalpel: Improving Gradient Surgery with an EMA

Yu-Guan Hsieh; James Thornton; Eugene Ndiaye; Michal Klein; Marco Cuturi; Pierre Ablin

Careful with that Scalpel: Improving Gradient Surgery with an EMA

Yu-Guan Hsieh, James Thornton, Eugene Ndiaye, Michal Klein, Marco Cuturi, Pierre Ablin

TL;DR

Careful with that Scalpel introduces Bloop, a bilevel-gradient method that resolves conflicts between a primary loss $L_{\mathrm{main}}$ and an auxiliary loss $L_{\mathrm{aux}}$ by updating along $d = g_{\mathrm{main}} + \lambda \pi(g_{\mathrm{aux}}; g_{\mathrm{main}})$, where $\pi$ is the gradient projection of $g_{\mathrm{aux}}$ orthogonal to $g_{\mathrm{main}}$. To handle stochastic optimization, it stabilizes the projection with an exponential moving average $g_{\mathrm{main}}^{\mathrm{EMA}}$, forming $d^{\mathrm{batch}} = g^{\mathrm{batch}}_{\mathrm{main}} + \lambda \pi(g^{\mathrm{batch}}_{\mathrm{aux}}; g_{\mathrm{main}}^{\mathrm{EMA}})$ so descent on $L_{\mathrm{main}}$ is preserved in expectation. Theoretical results connect the method to the simple bilevel problem, proving approximate stationarity in the full-batch setting and convergence for the stochastic version under standard assumptions, with the EMA playing a critical role. Empirically, Bloop achieves superior Pareto fronts across bias-imposition, multi-task, and joint-dataset experiments in NLP and vision, outperforming baselines like Mixed, Dynamic Barrier, and PCGrad, and the EMA component is essential for these gains.

Abstract

Beyond minimizing a single training loss, many deep learning estimation pipelines rely on an auxiliary objective to quantify and encourage desirable properties of the model (e.g. performance on another dataset, robustness, agreement with a prior). Although the simplest approach to incorporating an auxiliary loss is to sum it with the training loss as a regularizer, recent works have shown that one can improve performance by blending the gradients beyond a simple sum; this is known as gradient surgery. We cast the problem as a constrained minimization problem where the auxiliary objective is minimized among the set of minimizers of the training loss. To solve this bilevel problem, we follow a parameter update direction that combines the training loss gradient and the orthogonal projection of the auxiliary gradient to the training gradient. In a setting where gradients come from mini-batches, we explain how, using a moving average of the training loss gradients, we can carefully maintain this critical orthogonality property. We demonstrate that our method, Bloop, can lead to much better performances on NLP and vision experiments than other gradient surgery methods without EMA.

Careful with that Scalpel: Improving Gradient Surgery with an EMA

TL;DR

Careful with that Scalpel introduces Bloop, a bilevel-gradient method that resolves conflicts between a primary loss

and an auxiliary loss

by updating along

, where

is the gradient projection of

orthogonal to

. To handle stochastic optimization, it stabilizes the projection with an exponential moving average

, forming

so descent on

is preserved in expectation. Theoretical results connect the method to the simple bilevel problem, proving approximate stationarity in the full-batch setting and convergence for the stochastic version under standard assumptions, with the EMA playing a critical role. Empirically, Bloop achieves superior Pareto fronts across bias-imposition, multi-task, and joint-dataset experiments in NLP and vision, outperforming baselines like Mixed, Dynamic Barrier, and PCGrad, and the EMA component is essential for these gains.

Abstract

Paper Structure (26 sections, 2 theorems, 46 equations, 10 figures, 1 table, 1 algorithm)

This paper contains 26 sections, 2 theorems, 46 equations, 10 figures, 1 table, 1 algorithm.

Introduction
The Bloop Algorithm
Full-batch setting and main intuition
Stochastic extension for large-scale problems
Extension to multi-level hierarchical optimization
Theoretical Analysis
Approximate stationary points of Bloop
Convergence of stochastic Bloop
Conditioning compared to regularization method
Related Works
Experiments
Baselines and evaluation
Imposing an explicit bias during training
Multi-task learning
Joint training on two datasets
...and 11 more sections

Key Result

Proposition 1

If $d$ in eq:direction is such that $\|d\|\leq\varepsilon$, then we have $\|g_{\mathrm{main}}\|\leq \varepsilon$. Moreover if assum:local_error_bound holds, the Hessian of $L_{{\mathrm{main}}}$ is $M-$Lipschitz, and $\varepsilon$ is small enough, then there exists $v\in\mathbb{R}^p$ such that Conversely, given a point $\theta^*$ that satisfies the first order optimality conditions of eq:simple_bi

Figures (10)

Figure 1: Principle of the Bloop method: the direction we follow is the sum of the gradient of the main loss $g_{\mathrm{main}}$, and of the projection of the gradient of the auxiliary loss, orthogonal to $g_{\mathrm{main}}$. This enforces that, at the first order, following this direction yields the same decrease in $L_{{\mathrm{main}}}$ as following $g_{\mathrm{main}}$.
Figure 2: Effect of randomness on the projection: We fix the dimension of the parameter space to $p=100$, and draw both $g_{\text{main}}$ and $g_{\text{aux}}$ from the Gaussian distribution $\mathcal{N}(\mathbf{0}, I)$. These two vectors are fixed in the remainder of the experiment. We draw $g_{\text{main}}^{\text{batch}} \sim g_{\text{main}} + \sigma \mathcal{N}(\mathbf{0}, I)$ and use Monte-Carlo simulation to estimate $\mathbb{E}[d_{\text{simple}}^{\text{batch}}] =g_{\text{main}} + \mathbb{E}[\pi(g_{\text{aux}}; g_{\text{main}}^{\text{batch}})]$. We compare its value against $d_{\text{bloop}}=g_{\text{main}} + \pi(g_{\text{aux}}; g_{\text{main}})$, its theoretical value when $\sigma=0$ (the target direction), and $d_{\text{mixed}}=g_{\text{main}} + (1-1/100) g_{\text{aux}}$, its theoretical value when $\sigma$ tends to infinity. We see that the $\mathbb{E}[d^{\text{batch}}_{\text{simple}}]$ becomes closer to the gradient of the mixed method when the noise starts to dominate.
Figure 3: Trade-offs between the main and the auxiliary objectives in problems where the auxiliary loss is used to impose an explicit bias on the neural network. The symbols correspond to the parameters reached at the end of training and form a Pareto front, the transparent curves are the training trajectories. Bloop achieves a better trade-off than the other methods, which all perform similarly here.
Figure 4: Trade-off between the performances in the Cifar10Mnist multi-task learning problem. Bloop gives a better Pareto front.
Figure 5: Trade-offs between the main and the auxiliary objectives in problems in natural language processing experiments with transformer models, where the main loss is the loss over a large dataset and the auxiliary loss is a loss over a small dataset that can be overfitted easily. We observe that Bloop gets a significantly better Pareto front than all other methods, which perform similarly to the mixed method. Bloop gains in terms of optimization on the training losses transfer to the evaluation losses.
...and 5 more figures

Theorems & Definitions (2)

Proposition 1: Stationary points
Theorem 2: Convergence of Bloop

Careful with that Scalpel: Improving Gradient Surgery with an EMA

TL;DR

Abstract

Careful with that Scalpel: Improving Gradient Surgery with an EMA

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (2)