Table of Contents
Fetching ...

Gradient Deconfliction via Orthogonal Projections onto Subspaces For Multi-task Learning

Shijie Zhu, Hui Zhao, Tianshu Wu, Pengjie Wang, Hongbo Deng, Jian Xu, Bo Zheng

TL;DR

This paper addresses gradient conflicts in multi-task learning by introducing GradOPS, an orthogonal-projection method that enforces strong non-conflicting gradients. By projecting each task gradient $g_i$ onto the subspace orthogonal to the span of the others, GradOPS guarantees a final update $G'$ that does not conflict with any original task gradient, enabling simple, flexible trade-offs via a single hyperparameter $\alpha$ and reweighting with $w_i$. The authors provide convergence guarantees to Pareto stationary points and demonstrate state-of-the-art performance across diverse benchmarks, including multi-task classification, scene understanding, and large-scale recommendation, with the ability to discover multiple Pareto-optimal trade-offs. GradOPS also outperforms or matches existing MOO methods while being simpler and more robust to task order, suggesting strong non-conflicting gradients as a practical foundation for robust, versatile MTL. Overall, the work offers a scalable, principled approach to balancing competing tasks and enables practitioners to tailor trade-offs without extensive hyperparameter sweeps.

Abstract

Although multi-task learning (MTL) has been a preferred approach and successfully applied in many real-world scenarios, MTL models are not guaranteed to outperform single-task models on all tasks mainly due to the negative effects of conflicting gradients among the tasks. In this paper, we fully examine the influence of conflicting gradients and further emphasize the importance and advantages of achieving non-conflicting gradients which allows simple but effective trade-off strategies among the tasks with stable performance. Based on our findings, we propose the Gradient Deconfliction via Orthogonal Projections onto Subspaces (GradOPS) spanned by other task-specific gradients. Our method not only solves all conflicts among the tasks, but can also effectively search for diverse solutions towards different trade-off preferences among the tasks. Theoretical analysis on convergence is provided, and performance of our algorithm is fully testified on multiple benchmarks in various domains. Results demonstrate that our method can effectively find multiple state-of-the-art solutions with different trade-off strategies among the tasks on multiple datasets.

Gradient Deconfliction via Orthogonal Projections onto Subspaces For Multi-task Learning

TL;DR

This paper addresses gradient conflicts in multi-task learning by introducing GradOPS, an orthogonal-projection method that enforces strong non-conflicting gradients. By projecting each task gradient onto the subspace orthogonal to the span of the others, GradOPS guarantees a final update that does not conflict with any original task gradient, enabling simple, flexible trade-offs via a single hyperparameter and reweighting with . The authors provide convergence guarantees to Pareto stationary points and demonstrate state-of-the-art performance across diverse benchmarks, including multi-task classification, scene understanding, and large-scale recommendation, with the ability to discover multiple Pareto-optimal trade-offs. GradOPS also outperforms or matches existing MOO methods while being simpler and more robust to task order, suggesting strong non-conflicting gradients as a practical foundation for robust, versatile MTL. Overall, the work offers a scalable, principled approach to balancing competing tasks and enables practitioners to tailor trade-offs without extensive hyperparameter sweeps.

Abstract

Although multi-task learning (MTL) has been a preferred approach and successfully applied in many real-world scenarios, MTL models are not guaranteed to outperform single-task models on all tasks mainly due to the negative effects of conflicting gradients among the tasks. In this paper, we fully examine the influence of conflicting gradients and further emphasize the importance and advantages of achieving non-conflicting gradients which allows simple but effective trade-off strategies among the tasks with stable performance. Based on our findings, we propose the Gradient Deconfliction via Orthogonal Projections onto Subspaces (GradOPS) spanned by other task-specific gradients. Our method not only solves all conflicts among the tasks, but can also effectively search for diverse solutions towards different trade-off preferences among the tasks. Theoretical analysis on convergence is provided, and performance of our algorithm is fully testified on multiple benchmarks in various domains. Results demonstrate that our method can effectively find multiple state-of-the-art solutions with different trade-off strategies among the tasks on multiple datasets.

Paper Structure

This paper contains 28 sections, 4 theorems, 5 equations, 8 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Assume individual loss functions $\mathcal{L}_1, \mathcal{L}_2, ..., \mathcal{L}_T$ are differentiable. Suppose the gradient of $\mathcal{L}$ is L-Lipschitz with $L>0$. Then with the update step size $t < \frac{2}{TL}$, GradOPS in Section sec3.1 will converge to a Pareto stationary point.

Figures (8)

  • Figure 1: Illustrative example of gradient conflicts in a three-task learning problem using gradient descent (GD), PCGrad and GradOPS. Task-specific gradients are labeled $g_1$, $g_2$ and $g_3$. The aggregated gradient $G$ or $G'$ in (a),(b) and (c) conflicts with the original gradient $g_3$, $g_2$ and $g_3$, respectively, resulting in decreasing performance of corresponding tasks. Note that different processing orders in PCGrad ([1,2,3] for (b), [3,2,1] for (c)) lead to conflicts of $G$ with different original $g_i$. In contrast, the GradOPS-modified $g_{1}'$ is orthogonal to $S={\rm span}\{g_2, g_3\}$ with the conflicting part on $S$ removed, similarly for $g_{2}'$ and $g_{3}'$ (omitted). Thus, neither each $g_{i}'$ nor $G'$ conflicts with any of $\{g_i\}$.
  • Figure 2: Visualization of trade-offs in a 2D multi-task optimization problem. Shown are trajectories of each method with 3 different initial points (labeled with black $\bullet$) using Adam optimizer kingma2014adam. Gradient descent (GD) is unable to traverse the deep valley on two of the initial points because there are conflicting gradients and the gradient magnitude of one task is much larger than the other. For MGDA-UB MGDA18, CAGrad CAGrad, and IMTL-G IMTL, the final convergence point is fixed for each initial point. In contrast, GradOPS could converge to multiple points in the Pareto set by setting different $\alpha$. Experimental details are provided in Appendix \ref{['implement_detail']}.
  • Figure 3: Visualization of the update direction (in yellow) obtained by various methods on a two-task learning problem. We rescaled the update vector to half for better visibility. $g_1$ and $g_2$ represent the two task-specific gradients. MGDA-UB proposes to minimize the minimum possible convex combination of task gradients, and the update vector is perpendicular to the dashed line. IMTL-G proposes to make the projections of the update vector onto {$g_1$, $g_2$} to be equal. PCGrad and GradOPS project each gradient onto the normal plane of the other to obtain $g'_1$ and $g'_2$. For PCGrad, the final update vector is the average of {$g'_1$, $g'_2$}. GradOPS further reweights {$g'_1$, $g'_2$} to make trade-offs between two tasks. As a result, the final update direction of GradOPS is flexible between $g'_1$ and $g'_2$, covering the directions of MGDA-UB and IMTL-G instead of been fixed as other methods, and always doesn't conflict with each task-specific gradient.
  • Figure 4: Performance comparison between GradOPS($\alpha$=-3) and GradOPS-static with grid search weights $w^{\rm static}_i$. The x-axis denotes the L2 distance between $w^{\rm static}_i$ and the average weights of GradOPS($\alpha$=-3) over training steps.
  • Figure 5: Visualization of the loss surfaces of Figure \ref{['figure2']}.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4