MGDA Converges under Generalized Smoothness, Provably
Qi Zhang, Peiyao Xiao, Shaofeng Zou, Kaiyi Ji
TL;DR
This work addresses multi-objective optimization where objective losses satisfy generalized $\ell$-smoothness, a relaxation that captures neural network training dynamics beyond standard $L$-smoothness. It provides a comprehensive convergence theory for MGDA and its stochastic variant under generalized smoothness, focusing on both average and iteration-wise CA distances. The authors introduce a warm-start strategy and an efficient MGDA-FA variant, proving that MGDA can reach an $\epsilon$-accurate Pareto stationary point with $O(\epsilon^{-2})$ (deterministic) or $O(\epsilon^{-4})$ (stochastic) samples for the average CA distance, and tighter per-iteration CA guarantees at the cost of higher sample complexity ($O(\epsilon^{-11})$ deterministic, $O(\epsilon^{-17})$ stochastic). Empirical results on Cityscapes and NYU-v2 validate improved gradient conflict handling and show MGDA-FA’s speed advantage, supporting practical applicability to real-world multi-task settings and informing extensions to related MOO algorithms under generalized smoothness.
Abstract
Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning. Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard $L$-smooth or bounded-gradient assumptions, which typically do not hold for neural networks, such as Long short-term memory (LSTM) models and Transformers. In this paper, we study a more general and realistic class of generalized $\ell$-smooth loss functions, where $\ell$ is a general non-decreasing function of gradient norm. We revisit and analyze the fundamental multiple gradient descent algorithm (MGDA) and its stochastic version with double sampling for solving the generalized $\ell$-smooth MOO problems, which approximate the conflict-avoidant (CA) direction that maximizes the minimum improvement among objectives. We provide a comprehensive convergence analysis of these algorithms and show that they converge to an $ε$-accurate Pareto stationary point with a guaranteed $ε$-level average CA distance (i.e., the gap between the updating direction and the CA direction) over all iterations, where totally $\mathcal{O}(ε^{-2})$ and $\mathcal{O}(ε^{-4})$ samples are needed for deterministic and stochastic settings, respectively. We prove that they can also guarantee a tighter $ε$-level CA distance in each iteration using more samples. Moreover, we analyze an efficient variant of MGDA named MGDA-FA using only $\mathcal{O}(1)$ time and space, while achieving the same performance guarantee as MGDA.
