AdaTask: A Task-aware Adaptive Learning Rate Approach to Multi-task Learning

Enneng Yang; Junwei Pan; Ximei Wang; Haibin Yu; Li Shen; Xihua Chen; Lei Xiao; Jie Jiang; Guibing Guo

AdaTask: A Task-aware Adaptive Learning Rate Approach to Multi-task Learning

Enneng Yang, Junwei Pan, Ximei Wang, Haibin Yu, Li Shen, Xihua Chen, Lei Xiao, Jie Jiang, Guibing Guo

TL;DR

AdaTask tackles task dominance in multi-task learning by introducing per-task accumulative gradients and an AU/rAU-based metric to quantify parameter-wise task influence. By separating accumulative gradients for each task within adaptive optimizers (e.g., AdaGrad, RMSProp, Adam), AdaTask prevents any single task from dominating learning rates or updates, leading to improved performance on dominated tasks while preserving strong overall metrics. Extensive experiments across CityScapes, TikTok, and WeChat demonstrate substantial gains on dominated tasks and competitive or state-of-the-art average task performance, with additional gains achievable when combined with gradient-direction methods. The work also provides analysis showing AdaTask balances shared parameters across layers and reduces learning-rate dominance, offering a practical, architecture-agnostic approach to robust multi-task optimization.

Abstract

Multi-task learning (MTL) models have demonstrated impressive results in computer vision, natural language processing, and recommender systems. Even though many approaches have been proposed, how well these approaches balance different tasks on each parameter still remains unclear. In this paper, we propose to measure the task dominance degree of a parameter by the total updates of each task on this parameter. Specifically, we compute the total updates by the exponentially decaying Average of the squared Updates (AU) on a parameter from the corresponding task.Based on this novel metric, we observe that many parameters in existing MTL methods, especially those in the higher shared layers, are still dominated by one or several tasks. The dominance of AU is mainly due to the dominance of accumulative gradients from one or several tasks. Motivated by this, we propose a Task-wise Adaptive learning rate approach, AdaTask in short, to separate the \emph{accumulative gradients} and hence the learning rate of each task for each parameter in adaptive learning rate approaches (e.g., AdaGrad, RMSProp, and Adam). Comprehensive experiments on computer vision and recommender system MTL datasets demonstrate that AdaTask significantly improves the performance of dominated tasks, resulting SOTA average task-wise performance. Analysis on both synthetic and real-world datasets shows AdaTask balance parameters in every shared layer well.

AdaTask: A Task-aware Adaptive Learning Rate Approach to Multi-task Learning

TL;DR

Abstract

Paper Structure (30 sections, 10 equations, 8 figures, 11 tables, 4 algorithms)

This paper contains 30 sections, 10 equations, 8 figures, 11 tables, 4 algorithms.

Introduction
Related Work
Rethinking Task Dominance in MTL
Synthetic Dataset Setting
(RQ1) How can we quantify the task dominance of parameters in MTL model?
(RQ2) To what extent do existing MTL approaches tackle the task dominance issue?
(RQ3) How does task dominance impact the training of MTL models?
Our Proposed Method: AdaTask
Experiments
Performance Evaluation
Overall Results
Study on Task Dominance
Conclusion and Future Works
Acknowledgement
Appendix A: Efficient AdaTask
...and 15 more sections

Figures (8)

Figure 1: Illustration of $\text{rAU}(i,T,B)$ as a metric to measure the task dominance for shared parameters in MTL.
Figure 2: $\text{rAU}(i,T,B)$ of all shared parameters on the synthetic dataset for five MTL models: (a) EqualWeight (PCGrad is close to EqualWeight, it was removed due to page limitations.), (b) UW, (c) CAGrad, (d) GradNorm and (e) our AdaTask. The green area denotes the percentage of parameters dominated by task $A$, the red area denotes the percentage of parameters dominated by task $B$, and the yellow area denotes the percentage of balanced parameters.
Figure 3: $\text{rAU}(i,T,B)$ of shared parameters on the synthetic dataset for EqualWeight(RMSprop), AdaTask, and LAdaTask methods.
Figure 4: $\text{rAU}(i,T,B)$ of shared parameters on the CityScapes dataset for EqualWeight, GradNorm, and AdaTask methods.
Figure 5: $\text{rAU}(i,T,B)$ of shared parameters (MLP layer) on the TikTok dataset for EqualWeight, GradNorm, and AdaTask methods.
...and 3 more figures

AdaTask: A Task-aware Adaptive Learning Rate Approach to Multi-task Learning

TL;DR

Abstract

AdaTask: A Task-aware Adaptive Learning Rate Approach to Multi-task Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)