Table of Contents
Fetching ...

Mitigating Gradient Bias in Multi-objective Learning: A Provably Convergent Stochastic Approach

Heshan Fernando, Han Shen, Miao Liu, Subhajit Chaudhury, Keerthiram Murugesan, Tianyi Chen

TL;DR

This work develops a stochastic Multi-objective gradient Correction (MoCo) method that can guarantee convergence without increasing the batch size even in the non-convex setting and demonstrates the effectiveness of the method relative to state-of-the-art methods.

Abstract

Machine learning problems with multiple objective functions appear either in learning with multiple criteria where learning has to make a trade-off between multiple performance metrics such as fairness, safety and accuracy; or, in multi-task learning where multiple tasks are optimized jointly, sharing inductive bias between them. This problems are often tackled by the multi-objective optimization framework. However, existing stochastic multi-objective gradient methods and its variants (e.g., MGDA, PCGrad, CAGrad, etc.) all adopt a biased noisy gradient direction, which leads to degraded empirical performance. To this end, we develop a stochastic Multi-objective gradient Correction (MoCo) method for multi-objective optimization. The unique feature of our method is that it can guarantee convergence without increasing the batch size even in the non-convex setting. Simulations on multi-task supervised and reinforcement learning demonstrate the effectiveness of our method relative to state-of-the-art methods.

Mitigating Gradient Bias in Multi-objective Learning: A Provably Convergent Stochastic Approach

TL;DR

This work develops a stochastic Multi-objective gradient Correction (MoCo) method that can guarantee convergence without increasing the batch size even in the non-convex setting and demonstrates the effectiveness of the method relative to state-of-the-art methods.

Abstract

Machine learning problems with multiple objective functions appear either in learning with multiple criteria where learning has to make a trade-off between multiple performance metrics such as fairness, safety and accuracy; or, in multi-task learning where multiple tasks are optimized jointly, sharing inductive bias between them. This problems are often tackled by the multi-objective optimization framework. However, existing stochastic multi-objective gradient methods and its variants (e.g., MGDA, PCGrad, CAGrad, etc.) all adopt a biased noisy gradient direction, which leads to degraded empirical performance. To this end, we develop a stochastic Multi-objective gradient Correction (MoCo) method for multi-objective optimization. The unique feature of our method is that it can guarantee convergence without increasing the batch size even in the non-convex setting. Simulations on multi-task supervised and reinforcement learning demonstrate the effectiveness of our method relative to state-of-the-art methods.
Paper Structure (26 sections, 10 theorems, 115 equations, 5 figures, 9 tables, 2 algorithms)

This paper contains 26 sections, 10 theorems, 115 equations, 5 figures, 9 tables, 2 algorithms.

Key Result

Lemma 1

Define $\mathcal{F}_k$ as the $\sigma$-algebra generated by $Y_1,Y_2,...,Y_k$. Consider the sequences generated by eq:y update, eq:lambda update, eq:x update, eq:z update-new and eq:z update2. If we choose $T=1$ and $\eta_1=\beta_K$, under assumptions specified in Appendix app:moco-inexact, we have where $\alpha_k, \beta_k$ are the learning rates in updates eq:x update, eq:y update respectively,

Figures (5)

  • Figure 1: A toy example from liu2021conflict with two objective (Figures \ref{['fig:grad-task-1']} and \ref{['fig:grad-task-2']}) to show the impact of gradient bias. We use the mean objective as a reference when plotting the trajectories corresponding to each initialization (3 initializations in total). The starting points of the trajectories are denoted by a black$\bullet$, and the trajectories are shown fading from red (start) to yellow (end). The Pareto front is given by the gray bar, and the black$\star$ denotes the point in the Pareto front corresponding to equal weights to each objective. We implement recent MOO algorithms such as SMG liu2021stochastic, PCGrad yu2020gradient, and CAGrad liu2021conflict, and MGDA Desideri2012mgda alongside our method. Except for MGDA (Figure \ref{['fig:toy-mgda']}) all the other algorithms only have access to gradients of each objective with added zero mean Gaussian noise. It can be observed that SMG, CAGrad, and PCGrad fail to find the Pareto front in some initializations.
  • Figure 2: Comparison of trajectories in the objective space. We use five initializations in the same toy example in Figure \ref{['fig:toy-comp']}, and plot the optimization trajectory in the objective space. MGDA converges to the Pareto front from all of the initializations. SMG, PCGrad, and CAGrad which only have access to single stochastic gradient per objective fail to converge to the Pareto front in some initializations. Our MoCo follows a similar trajectory to that of MGDA, and finds the Pareto front for each initialization.
  • Figure 3: Comparison of multi-gradient error
  • Figure 4: Training and test loss for the Cityscapes tasks
  • Figure 5: Training and test loss for the NYU-v2 tasks

Theorems & Definitions (22)

  • Remark 1: On the Lipschitz continuity of $\lambda^*_\rho(x)$
  • Lemma 1
  • Lemma 2
  • Remark 2: Connection between nested MOO with multi-objective actor-critic
  • Lemma 3
  • Theorem 1
  • Remark 3: Comparison with SMG liu2021stochastic
  • Theorem 2
  • Theorem 3
  • Remark 4: On the stronger assumptions
  • ...and 12 more