Table of Contents
Fetching ...

TADFormer : Task-Adaptive Dynamic Transformer for Efficient Multi-Task Learning

Seungmin Baek, Soyul Lee, Hayeon Jo, Hyesong Choi, Dongbo Min

TL;DR

TADFormer tackles the efficiency challenge of multi-task learning in vision by introducing task prompts and a Dynamic Task Filter (DTF) within a task-adaptive Transformer framework. The approach combines a parameter-efficient shared TS-Module with a Task-Prompt Conditional (TPC) operator and a TA-Module that uses DTF to generate input-context-aware, task-specific features. Quantitative results on PASCAL-Context show superior accuracy and up to 8.4x fewer trainable parameters than full fine-tuning, with further gains when using larger backbones or alternative decoders. The method also proves compatible with adapter-based PEFT schemes, underscoring its flexibility and practical impact for scalable, multi-task dense prediction.

Abstract

Transfer learning paradigm has driven substantial advancements in various vision tasks. However, as state-of-the-art models continue to grow, classical full fine-tuning often becomes computationally impractical, particularly in multi-task learning (MTL) setup where training complexity increases proportional to the number of tasks. Consequently, recent studies have explored Parameter-Efficient Fine-Tuning (PEFT) for MTL architectures. Despite some progress, these approaches still exhibit limitations in capturing fine-grained, task-specific features that are crucial to MTL. In this paper, we introduce Task-Adaptive Dynamic transFormer, termed TADFormer, a novel PEFT framework that performs task-aware feature adaptation in the fine-grained manner by dynamically considering task-specific input contexts. TADFormer proposes the parameter-efficient prompting for task adaptation and the Dynamic Task Filter (DTF) to capture task information conditioned on input contexts. Experiments on the PASCAL-Context benchmark demonstrate that the proposed method achieves higher accuracy in dense scene understanding tasks, while reducing the number of trainable parameters by up to 8.4 times when compared to full fine-tuning of MTL models. TADFormer also demonstrates superior parameter efficiency and accuracy compared to recent PEFT methods.

TADFormer : Task-Adaptive Dynamic Transformer for Efficient Multi-Task Learning

TL;DR

TADFormer tackles the efficiency challenge of multi-task learning in vision by introducing task prompts and a Dynamic Task Filter (DTF) within a task-adaptive Transformer framework. The approach combines a parameter-efficient shared TS-Module with a Task-Prompt Conditional (TPC) operator and a TA-Module that uses DTF to generate input-context-aware, task-specific features. Quantitative results on PASCAL-Context show superior accuracy and up to 8.4x fewer trainable parameters than full fine-tuning, with further gains when using larger backbones or alternative decoders. The method also proves compatible with adapter-based PEFT schemes, underscoring its flexibility and practical impact for scalable, multi-task dense prediction.

Abstract

Transfer learning paradigm has driven substantial advancements in various vision tasks. However, as state-of-the-art models continue to grow, classical full fine-tuning often becomes computationally impractical, particularly in multi-task learning (MTL) setup where training complexity increases proportional to the number of tasks. Consequently, recent studies have explored Parameter-Efficient Fine-Tuning (PEFT) for MTL architectures. Despite some progress, these approaches still exhibit limitations in capturing fine-grained, task-specific features that are crucial to MTL. In this paper, we introduce Task-Adaptive Dynamic transFormer, termed TADFormer, a novel PEFT framework that performs task-aware feature adaptation in the fine-grained manner by dynamically considering task-specific input contexts. TADFormer proposes the parameter-efficient prompting for task adaptation and the Dynamic Task Filter (DTF) to capture task information conditioned on input contexts. Experiments on the PASCAL-Context benchmark demonstrate that the proposed method achieves higher accuracy in dense scene understanding tasks, while reducing the number of trainable parameters by up to 8.4 times when compared to full fine-tuning of MTL models. TADFormer also demonstrates superior parameter efficiency and accuracy compared to recent PEFT methods.
Paper Structure (25 sections, 7 equations, 10 figures, 4 tables)

This paper contains 25 sections, 7 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Comparison of Grad-CAM selvaraju2017grad from MTLoRA agiza2024mtlora and TADFormer: (from top to bottom) input images, MTLoRA, and TADFormer. This demonstrates that TADFormer is capable of extracting fine-grained features that capture the input contexts more precisely, thanks to DTF.
  • Figure 2: Overview of the proposed TADFormer: (a) The encoder takes as inputs image patch tokens with task prompts prepended. Here, we adopt the VPT shallow approach jia2022visual, inserting the task prompts only into the first Transformer stage, and use Swin Transformer liu2021swin as the encoder backbone. In all blocks except the last one of each Transformer stage, the task-agnostic features are extracted through the task-shared module (TS-module). The task-adapting Transformer block extracts fine-grained task-specific features through the task-prompt conditional (TPC) operator and task-aware module (TA-Module), (b) The TPC operator generates the task-adapted features with the help of the task attention map between task prompts and image patch tokens, and these features are then fed into the TA-module consisting of the dynamic task filter (DTF) as well as down-up projections for considering input contexts that are crucial to MTL.
  • Figure 3: DTF architecture: The down-projected features of channel dimension $r$ are used to generate channel-wise convolution parameters in the parameter generation network. GAP denotes global average pooling.
  • Figure 4: Patch merging module of TADFormer: Leveraging the benefits of task-prompt tuning, we finetune only the prompt upsampling operation that doubles the channel size of the task prompts, while freezing the image patch merging module.
  • Figure 5: Visualization of the task attentnion map (TAM) used by the TPC-operator. Notice that the task-prompts corresponding to each task are focused on different parts of the image.
  • ...and 5 more figures