Table of Contents
Fetching ...

Selective Task Group Updates for Multi-Task Optimization

Wooseong Jeong, Kuk-Jin Yoon

TL;DR

This work tackles negative transfer in multi-task learning by moving beyond shared-parameter balance and introducing selective task group updates guided by proximal inter-task affinity. Tasks are partitioned into evolving groups, and updates are performed sequentially within each group per batch to better learn task-specific representations while maintaining overall stability. The authors provide a theoretical convergence analysis showing Pareto-stationary points under standard Lipschitz-gradient assumptions and validate the approach with experiments on NYUD-v2, PASCAL-Context, and Taskonomy, where their method outperforms loss- and gradient-based baselines. The approach is shown to be scalable, with favorable computational characteristics and robust performance across backbones, groupings, and batch sizes, highlighting its practical impact for efficient, scalable multi-task optimization.

Abstract

Multi-task learning enables the acquisition of task-generic knowledge by training multiple tasks within a unified architecture. However, training all tasks together in a single architecture can lead to performance degradation, known as negative transfer, which is a main concern in multi-task learning. Previous works have addressed this issue by optimizing the multi-task network through gradient manipulation or weighted loss adjustments. However, their optimization strategy focuses on addressing task imbalance in shared parameters, neglecting the learning of task-specific parameters. As a result, they show limitations in mitigating negative transfer, since the learning of shared space and task-specific information influences each other during optimization. To address this, we propose a different approach to enhance multi-task performance by selectively grouping tasks and updating them for each batch during optimization. We introduce an algorithm that adaptively determines how to effectively group tasks and update them during the learning process. To track inter-task relations and optimize multi-task networks simultaneously, we propose proximal inter-task affinity, which can be measured during the optimization process. We provide a theoretical analysis on how dividing tasks into multiple groups and updating them sequentially significantly affects multi-task performance by enhancing the learning of task-specific parameters. Our methods substantially outperform previous multi-task optimization approaches and are scalable to different architectures and various numbers of tasks.

Selective Task Group Updates for Multi-Task Optimization

TL;DR

This work tackles negative transfer in multi-task learning by moving beyond shared-parameter balance and introducing selective task group updates guided by proximal inter-task affinity. Tasks are partitioned into evolving groups, and updates are performed sequentially within each group per batch to better learn task-specific representations while maintaining overall stability. The authors provide a theoretical convergence analysis showing Pareto-stationary points under standard Lipschitz-gradient assumptions and validate the approach with experiments on NYUD-v2, PASCAL-Context, and Taskonomy, where their method outperforms loss- and gradient-based baselines. The approach is shown to be scalable, with favorable computational characteristics and robust performance across backbones, groupings, and batch sizes, highlighting its practical impact for efficient, scalable multi-task optimization.

Abstract

Multi-task learning enables the acquisition of task-generic knowledge by training multiple tasks within a unified architecture. However, training all tasks together in a single architecture can lead to performance degradation, known as negative transfer, which is a main concern in multi-task learning. Previous works have addressed this issue by optimizing the multi-task network through gradient manipulation or weighted loss adjustments. However, their optimization strategy focuses on addressing task imbalance in shared parameters, neglecting the learning of task-specific parameters. As a result, they show limitations in mitigating negative transfer, since the learning of shared space and task-specific information influences each other during optimization. To address this, we propose a different approach to enhance multi-task performance by selectively grouping tasks and updating them for each batch during optimization. We introduce an algorithm that adaptively determines how to effectively group tasks and update them during the learning process. To track inter-task relations and optimize multi-task networks simultaneously, we propose proximal inter-task affinity, which can be measured during the optimization process. We provide a theoretical analysis on how dividing tasks into multiple groups and updating them sequentially significantly affects multi-task performance by enhancing the learning of task-specific parameters. Our methods substantially outperform previous multi-task optimization approaches and are scalable to different architectures and various numbers of tasks.

Paper Structure

This paper contains 23 sections, 10 theorems, 49 equations, 11 figures, 12 tables, 1 algorithm.

Key Result

Theorem 1

Let $g_k$ denote the task-specific gradients backpropagated from the loss function $\mathcal{L}_k$ with respect to the parameters $\Theta_s^t$. At a given time step $t$, if the inter-task affinity from task group $\{i, k\}$ to task $k$ is greater than or equal to the inter-task affinity from group $

Figures (11)

  • Figure 1: Comparison of multi-task optimization methods. $\Theta$ represents the network parameters, and $\{\mathcal{L}\}_{i=1}^{\mathcal{K}}$ denotes the task-specific losses for $\mathcal{K}$ tasks. (a) Loss-based approaches balance the loss by adjusting the weights $\{w_i\}_{i=1}^{\mathcal{K}}$ during optimization. (b) Gradient-based approaches modify the task-specific gradients $\{g_i\}_{i=1}^{\mathcal{K}}$ with respect to $\Theta$. (c) Our method divides the tasks into $\mathcal{M}$ groups (in this case, $\mathcal{M}=2$) and updates them sequentially for each batch during optimization.
  • Figure 2: Comparison of the average time required by each optimization process to handle a single batch for 5 tasks on PASCAL-Context (left) and 11 tasks on Taskonomy (right).
  • Figure 3: The averaged grouping results $\{G\}_{i=1}^{\mathcal{M}}$ on the Taskonomy benchmark are shown for ViT-L in (a) and for ViT-T in (b). (c) illustrates how the number of task groups, $\mathcal{M}$, changes during optimization. (d) shows the change in proximal inter-task affinity from DE to C.
  • Figure 4: The averaged grouping results $\{G\}_{i=1}^{\mathcal{M}}$ are shown for NYUD-v2 in (a) and PASCAL-Context in (b). (c) illustrates how the decaying factor $\beta$ influences the stable tracking of proximal inter-task affinity.
  • Figure : Tracking Proximal Inter-Task Affinity for Task Group Updates
  • ...and 6 more figures

Theorems & Definitions (18)

  • Definition 1: Inter-Task Affinity
  • Definition 2: Proximal Inter-Task Affinity
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4: Convergence Analysis
  • Theorem 5
  • Definition 3: Lipschitz continuity
  • Theorem 5
  • proof
  • ...and 8 more