Table of Contents
Fetching ...

Dual-Balancing for Multi-Task Learning

Baijiong Lin, Weisen Jiang, Feiyang Ye, Yu Zhang, Pengguang Chen, Ying-Cong Chen, Shu Liu, Ivor W. Tsang, James T. Kwok

TL;DR

DB-MTL tackles the persistent problem of task imbalance in multi-task learning by balancing both loss scales and gradient magnitudes. It introduces a parameter-free log transformation to equalize task losses and a maximum-norm gradient normalization (with EMA gradient estimates) to harmonize update magnitudes across tasks. Across diverse benchmarks (scene understanding, molecular property prediction, and image classification), DB-MTL consistently outperforms state-of-the-art baselines and, in many cases, matches or approaches STL on harder tasks, while also enabling effective combinations with other gradient-balancing methods. The approach improves training stability and reduces gradient conflicts, suggesting substantial practical impact for robust, scalable MTL in real-world settings; future work includes gradient variance considerations and theoretical convergence analysis.

Abstract

Multi-task learning aims to learn multiple related tasks simultaneously and has achieved great success in various fields. However, the disparity in loss and gradient scales among tasks often leads to performance compromises, and the balancing of tasks remains a significant challenge. In this paper, we propose Dual-Balancing Multi-Task Learning (DB-MTL) to achieve task balancing from both the loss and gradient perspectives. Specifically, DB-MTL achieves loss-scale balancing by performing logarithm transformation on each task loss, and rescales gradient magnitudes by normalizing all task gradients to comparable magnitudes using the maximum gradient norm. Extensive experiments on a number of benchmark datasets demonstrate that DB-MTL consistently performs better than the current state-of-the-art.

Dual-Balancing for Multi-Task Learning

TL;DR

DB-MTL tackles the persistent problem of task imbalance in multi-task learning by balancing both loss scales and gradient magnitudes. It introduces a parameter-free log transformation to equalize task losses and a maximum-norm gradient normalization (with EMA gradient estimates) to harmonize update magnitudes across tasks. Across diverse benchmarks (scene understanding, molecular property prediction, and image classification), DB-MTL consistently outperforms state-of-the-art baselines and, in many cases, matches or approaches STL on harder tasks, while also enabling effective combinations with other gradient-balancing methods. The approach improves training stability and reduces gradient conflicts, suggesting substantial practical impact for robust, scalable MTL in real-world settings; future work includes gradient variance considerations and theoretical convergence analysis.

Abstract

Multi-task learning aims to learn multiple related tasks simultaneously and has achieved great success in various fields. However, the disparity in loss and gradient scales among tasks often leads to performance compromises, and the balancing of tasks remains a significant challenge. In this paper, we propose Dual-Balancing Multi-Task Learning (DB-MTL) to achieve task balancing from both the loss and gradient perspectives. Specifically, DB-MTL achieves loss-scale balancing by performing logarithm transformation on each task loss, and rescales gradient magnitudes by normalizing all task gradients to comparable magnitudes using the maximum gradient norm. Extensive experiments on a number of benchmark datasets demonstrate that DB-MTL consistently performs better than the current state-of-the-art.
Paper Structure (37 sections, 1 theorem, 4 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 37 sections, 1 theorem, 4 equations, 10 figures, 6 tables, 1 algorithm.

Key Result

Proposition 3.1

For $x>0$, $\log(x) = \min_s e^s x - s - 1$.

Figures (10)

  • Figure 1: Performance of existing gradient balancing methods with the loss-scale balancing method (i.e., logarithm transformation) on NYUv2. "vanilla" stands for the original method.
  • Figure 2: Comparison of IMTL-L liu2021imtl and the loss-scale balancing method on four datasets.
  • Figure 3: Comparison of GradNorm chen2018gradnorm and the gradient-magnitude balancing method on four datasets.
  • Figure 4: Performance on NYUv2 for Cross-stitchMisraSGH16 and MTANljd19 architectures.
  • Figure 5: Effect of EMA's Forgetting Rate $\beta$ in Eq. \ref{['eq:beta']} on the Office-31 dataset. $k$ denotes the number of iterations.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Proposition 3.1
  • proof