Table of Contents
Fetching ...

Task Indicating Transformer for Task-conditional Dense Predictions

Yuxiang Lu, Shalayiding Sirejiding, Bayram Bayramli, Suizhi Huang, Yue Ding, Hongtao Lu

TL;DR

The paper addresses the challenge of learning shared and task-specific representations for multi-task dense prediction by combining a Vision Transformer backbone with a novel Task Indicating Transformer (TIT). TIT introduces a Mix Task Adapter with a Task Indicating Matrix to achieve parameter-efficient, task-conditioned feature adaptation, and a Task Gate Decoder that uses a Task Indicating Vector to enable adaptive, multi-scale refinement guided by task context. Across NYUD-v2 and PASCAL-Context, TIT achieves state-of-the-art performance, surpassing existing task-conditional methods while reducing parameter overhead through low-rank factorization and shared gates. This work suggests a scalable path for efficient, high-performance task-conditioned dense prediction in real-world applications.

Abstract

The task-conditional model is a distinctive stream for efficient multi-task learning. Existing works encounter a critical limitation in learning task-agnostic and task-specific representations, primarily due to shortcomings in global context modeling arising from CNN-based architectures, as well as a deficiency in multi-scale feature interaction within the decoder. In this paper, we introduce a novel task-conditional framework called Task Indicating Transformer (TIT) to tackle this challenge. Our approach designs a Mix Task Adapter module within the transformer block, which incorporates a Task Indicating Matrix through matrix decomposition, thereby enhancing long-range dependency modeling and parameter-efficient feature adaptation by capturing intra- and inter-task features. Moreover, we propose a Task Gate Decoder module that harnesses a Task Indicating Vector and gating mechanism to facilitate adaptive multi-scale feature refinement guided by task embeddings. Experiments on two public multi-task dense prediction benchmarks, NYUD-v2 and PASCAL-Context, demonstrate that our approach surpasses state-of-the-art task-conditional methods.

Task Indicating Transformer for Task-conditional Dense Predictions

TL;DR

The paper addresses the challenge of learning shared and task-specific representations for multi-task dense prediction by combining a Vision Transformer backbone with a novel Task Indicating Transformer (TIT). TIT introduces a Mix Task Adapter with a Task Indicating Matrix to achieve parameter-efficient, task-conditioned feature adaptation, and a Task Gate Decoder that uses a Task Indicating Vector to enable adaptive, multi-scale refinement guided by task context. Across NYUD-v2 and PASCAL-Context, TIT achieves state-of-the-art performance, surpassing existing task-conditional methods while reducing parameter overhead through low-rank factorization and shared gates. This work suggests a scalable path for efficient, high-performance task-conditioned dense prediction in real-world applications.

Abstract

The task-conditional model is a distinctive stream for efficient multi-task learning. Existing works encounter a critical limitation in learning task-agnostic and task-specific representations, primarily due to shortcomings in global context modeling arising from CNN-based architectures, as well as a deficiency in multi-scale feature interaction within the decoder. In this paper, we introduce a novel task-conditional framework called Task Indicating Transformer (TIT) to tackle this challenge. Our approach designs a Mix Task Adapter module within the transformer block, which incorporates a Task Indicating Matrix through matrix decomposition, thereby enhancing long-range dependency modeling and parameter-efficient feature adaptation by capturing intra- and inter-task features. Moreover, we propose a Task Gate Decoder module that harnesses a Task Indicating Vector and gating mechanism to facilitate adaptive multi-scale feature refinement guided by task embeddings. Experiments on two public multi-task dense prediction benchmarks, NYUD-v2 and PASCAL-Context, demonstrate that our approach surpasses state-of-the-art task-conditional methods.
Paper Structure (10 sections, 7 equations, 3 figures, 4 tables)

This paper contains 10 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: (a) Architecture of proposed Task Indicating Transformer (TIT). (b) Structure of transformer block within the encoder's transformer layers. Mix Task Adapter modules are inserted after the Multi-head Self Attention (MSA) layer and the Multi-Layer Perceptron (MLP) layer.
  • Figure 2: Illustrations of proposed Mix Task Adapter module and Task Gate Decoder module. Different colors of the Task Indicating Matrix and Task Indicating Vector correspond to distinct task types.
  • Figure 3: Qualitative results on PASCAL-Context dataset.