Mitigating Gradient Bias in Multi-objective Learning: A Provably Convergent Stochastic Approach

Heshan Fernando; Han Shen; Miao Liu; Subhajit Chaudhury; Keerthiram Murugesan; Tianyi Chen

Mitigating Gradient Bias in Multi-objective Learning: A Provably Convergent Stochastic Approach

Heshan Fernando, Han Shen, Miao Liu, Subhajit Chaudhury, Keerthiram Murugesan, Tianyi Chen

TL;DR

This work develops a stochastic Multi-objective gradient Correction (MoCo) method that can guarantee convergence without increasing the batch size even in the non-convex setting and demonstrates the effectiveness of the method relative to state-of-the-art methods.

Abstract

Machine learning problems with multiple objective functions appear either in learning with multiple criteria where learning has to make a trade-off between multiple performance metrics such as fairness, safety and accuracy; or, in multi-task learning where multiple tasks are optimized jointly, sharing inductive bias between them. This problems are often tackled by the multi-objective optimization framework. However, existing stochastic multi-objective gradient methods and its variants (e.g., MGDA, PCGrad, CAGrad, etc.) all adopt a biased noisy gradient direction, which leads to degraded empirical performance. To this end, we develop a stochastic Multi-objective gradient Correction (MoCo) method for multi-objective optimization. The unique feature of our method is that it can guarantee convergence without increasing the batch size even in the non-convex setting. Simulations on multi-task supervised and reinforcement learning demonstrate the effectiveness of our method relative to state-of-the-art methods.

Mitigating Gradient Bias in Multi-objective Learning: A Provably Convergent Stochastic Approach

TL;DR

Abstract

Paper Structure (26 sections, 10 theorems, 115 equations, 5 figures, 9 tables, 2 algorithms)

This paper contains 26 sections, 10 theorems, 115 equations, 5 figures, 9 tables, 2 algorithms.

Introduction
Background
Pareto optimality and Pareto stationarity
Multiple Gradient Descent Algorithm
Stochastic multi-objective gradient and its brittleness
Stochastic Multi-objective Gradient Descent With Correction
A basic algorithmic framework
Generalization to nested MOO setting
A unified convergence result
Convergence with stronger assumptions
Related work
Experiments
Supervised learning
Reinforcement learning
Conclusions
...and 11 more sections

Key Result

Lemma 1

Define $\mathcal{F}_k$ as the $\sigma$-algebra generated by $Y_1,Y_2,...,Y_k$. Consider the sequences generated by eq:y update, eq:lambda update, eq:x update, eq:z update-new and eq:z update2. If we choose $T=1$ and $\eta_1=\beta_K$, under assumptions specified in Appendix app:moco-inexact, we have where $\alpha_k, \beta_k$ are the learning rates in updates eq:x update, eq:y update respectively,

Figures (5)

Figure 1: A toy example from liu2021conflict with two objective (Figures \ref{['fig:grad-task-1']} and \ref{['fig:grad-task-2']}) to show the impact of gradient bias. We use the mean objective as a reference when plotting the trajectories corresponding to each initialization (3 initializations in total). The starting points of the trajectories are denoted by a black$\bullet$, and the trajectories are shown fading from red (start) to yellow (end). The Pareto front is given by the gray bar, and the black$\star$ denotes the point in the Pareto front corresponding to equal weights to each objective. We implement recent MOO algorithms such as SMG liu2021stochastic, PCGrad yu2020gradient, and CAGrad liu2021conflict, and MGDA Desideri2012mgda alongside our method. Except for MGDA (Figure \ref{['fig:toy-mgda']}) all the other algorithms only have access to gradients of each objective with added zero mean Gaussian noise. It can be observed that SMG, CAGrad, and PCGrad fail to find the Pareto front in some initializations.
Figure 2: Comparison of trajectories in the objective space. We use five initializations in the same toy example in Figure \ref{['fig:toy-comp']}, and plot the optimization trajectory in the objective space. MGDA converges to the Pareto front from all of the initializations. SMG, PCGrad, and CAGrad which only have access to single stochastic gradient per objective fail to converge to the Pareto front in some initializations. Our MoCo follows a similar trajectory to that of MGDA, and finds the Pareto front for each initialization.
Figure 3: Comparison of multi-gradient error
Figure 4: Training and test loss for the Cityscapes tasks
Figure 5: Training and test loss for the NYU-v2 tasks

Theorems & Definitions (22)

Remark 1: On the Lipschitz continuity of $\lambda^*_\rho(x)$
Lemma 1
Lemma 2
Remark 2: Connection between nested MOO with multi-objective actor-critic
Lemma 3
Theorem 1
Remark 3: Comparison with SMG liu2021stochastic
Theorem 2
Theorem 3
Remark 4: On the stronger assumptions
...and 12 more

Mitigating Gradient Bias in Multi-objective Learning: A Provably Convergent Stochastic Approach

TL;DR

Abstract

Mitigating Gradient Bias in Multi-objective Learning: A Provably Convergent Stochastic Approach

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (22)