Robust Multi-Task Learning with Excess Risks

Yifei He; Shiji Zhou; Guojun Zhang; Hyokun Yun; Yi Xu; Belinda Zeng; Trishul Chilimbi; Han Zhao

Robust Multi-Task Learning with Excess Risks

Yifei He, Shiji Zhou, Guojun Zhang, Hyokun Yun, Yi Xu, Belinda Zeng, Trishul Chilimbi, Han Zhao

TL;DR

This work addresses robustness of multi-task learning to label noise, where conventional loss-based weighting overemphasizes corrupted tasks. It introduces ExcessMTL, a task-balancing method that uses excess risks to measure distance to convergence and updates task weights via online exponentiated gradient within a min–max framework. Excess risks are efficiently estimated with a Taylor-based expansion and a diagonal empirical Fisher approximation, and the algorithm includes scale normalization to ensure cross-task comparability; theoretical results show O(1/\\sqrt{t}) convergence with Pareto optimality in convex settings and Pareto stationarity in non-convex settings. Empirical evaluations on MultiMNIST, Office-Home, and NYUv2 demonstrate superior robustness to label noise, preserving performance on clean tasks while downweighting noisy tasks, outperforming baselines such as MGDA, GroupDRO, GradNorm, IMTL, and MOML. Overall, ExcessMTL provides a principled, scalable approach to robust multi-task balancing with implications for other loss-weighting scenarios beyond MTL.

Abstract

Multi-task learning (MTL) considers learning a joint model for multiple tasks by optimizing a convex combination of all task losses. To solve the optimization problem, existing methods use an adaptive weight updating scheme, where task weights are dynamically adjusted based on their respective losses to prioritize difficult tasks. However, these algorithms face a great challenge whenever label noise is present, in which case excessive weights tend to be assigned to noisy tasks that have relatively large Bayes optimal errors, thereby overshadowing other tasks and causing performance to drop across the board. To overcome this limitation, we propose Multi-Task Learning with Excess Risks (ExcessMTL), an excess risk-based task balancing method that updates the task weights by their distances to convergence instead. Intuitively, ExcessMTL assigns higher weights to worse-trained tasks that are further from convergence. To estimate the excess risks, we develop an efficient and accurate method with Taylor approximation. Theoretically, we show that our proposed algorithm achieves convergence guarantees and Pareto stationarity. Empirically, we evaluate our algorithm on various MTL benchmarks and demonstrate its superior performance over existing methods in the presence of label noise. Our code is available at https://github.com/yifei-he/ExcessMTL.

Robust Multi-Task Learning with Excess Risks

TL;DR

Abstract

Paper Structure (21 sections, 11 theorems, 59 equations, 9 figures, 1 algorithm)

This paper contains 21 sections, 11 theorems, 59 equations, 9 figures, 1 algorithm.

Introduction
Preliminaries
Excess Risks
Multi-Task Learning
Multi-Task Learning with Excess Risks
Motivations and Objectives
Algorithm
Conceptual Comparison
Theoretical Analysis
Experiments
Datasets
Noise Injection Scheme
Empirical Analysis and Comparison
Benchmark Evaluation
Related Work
...and 6 more sections

Key Result

Theorem 3.1

Suppose (i) each task-specific loss $\ell_i$ is L-Lipschitz, (ii) $\ell_i$ is convex on the model parameter $\theta$, (iii) $\ell_i$ bounded by $B_\ell$ and (iv) $\|\theta\|_2$ is bounded by $B_\theta$. At training step $t$, let $\bar{\theta}^{(1:t)}\coloneqq \frac{1}{t}\sum_{\tau=1}^t{\theta^{(\tau where $m$ is the number of tasks.

Figures (9)

Figure 1: Conceptual comparison between ExcessMTL and loss weighting methods. The figure shows a two-task MTL setting, where Task 1 contains label noise, while Task 2 does not. Thus, the Bayes optimal loss (dashed line) for Task 1 is non-zero. The curve represents the Pareto front, i.e., all points on the curve are Pareto optimal. Loss weighting methods aim to find the solution with equal losses for two tasks, severely sacrificing the performance of Task 2. On the other hand, ExcessMTL finds the solution with equal excess risks, striking a better balance between the two tasks.
Figure 2: Excess risks on MultiMNIST with noise level 0.6. The estimated excess risk well matches the ground-truth pattern.
Figure 3: Weight and accuracy on the MultiMNIST dataset with a noise level of 0.8. ExcessMTL assigns most weight to the clean task so that the performance is least affected by the injected noise.
Figure 4: MultiMNIST loss profile (lower left better). The left plot has no noise injected, while in the right one, task 2 has 80% noise. With no noise injected, all algorithms achieve ideal performance. However, with significant noise injected, only ExcessMTL retains performance close to Bayes optimal on both tasks.
Figure 5: Comparison with MOML and MGDA. MGDA and MOML use the same method to select weights on training and validation set respectively. Despite more consistent weight assignment than MGDA, MOML fails when noise level is high, showing that a clean validation set does not alleviate the label noise issue. ExcessMTL ourperforms both baselines.
...and 4 more figures

Theorems & Definitions (20)

Definition 2.1: Pareto dominance
Definition 2.2: Pareto optimal
Definition 2.3: Pareto stationary
Theorem 3.1: Convergence
Corollary 3.1: Pareto Optimality
Theorem 3.2: Pareto Stationarity
Theorem 1.1: Convergence
proof
Theorem 1.4: nemirovski2009robust
Corollary 1.4: Pareto Optimality
...and 10 more

Robust Multi-Task Learning with Excess Risks

TL;DR

Abstract

Robust Multi-Task Learning with Excess Risks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (20)