Table of Contents
Fetching ...

Masked AutoDecoder is Effective Multi-Task Vision Generalist

Han Qiu, Jiaxing Huang, Peng Gao, Lewei Lu, Xiaoqin Zhang, Shijian Lu

TL;DR

MAD introduces a Masked AutoDecoder that unifies four vision tasks in a single sequence format using parallel decoding with bidirectional attention and masked sequence modeling. By tokenizing task outputs into a universal vocabulary, employing Hungarian matching for deterministic targets, and training with both fully and partially masked sequences, MAD learns rich task contexts and achieves fast, single-branch multi-task inference. Experiments on COCO demonstrate competitive accuracy across object detection, instance segmentation, keypoint detection, and image captioning, along with substantial inference speedups compared with autoregressive and task-specific baselines. This approach offers a scalable paradigm for vision generalists and lays groundwork for expanding task coverage with minimal task-specific engineering.

Abstract

Inspired by the success of general-purpose models in NLP, recent studies attempt to unify different vision tasks in the same sequence format and employ autoregressive Transformers for sequence prediction. They apply uni-directional attention to capture sequential dependencies and generate task sequences recursively. However, such autoregressive Transformers may not fit vision tasks well, as vision task sequences usually lack the sequential dependencies typically observed in natural languages. In this work, we design Masked AutoDecoder~(MAD), an effective multi-task vision generalist. MAD consists of two core designs. First, we develop a parallel decoding framework that introduces bi-directional attention to capture contextual dependencies comprehensively and decode vision task sequences in parallel. Second, we design a masked sequence modeling approach that learns rich task contexts by masking and reconstructing task sequences. In this way, MAD handles all the tasks by a single network branch and a simple cross-entropy loss with minimal task-specific designs. Extensive experiments demonstrate the great potential of MAD as a new paradigm for unifying various vision tasks. MAD achieves superior performance and inference efficiency compared to autoregressive counterparts while obtaining competitive accuracy with task-specific models. Code will be released.

Masked AutoDecoder is Effective Multi-Task Vision Generalist

TL;DR

MAD introduces a Masked AutoDecoder that unifies four vision tasks in a single sequence format using parallel decoding with bidirectional attention and masked sequence modeling. By tokenizing task outputs into a universal vocabulary, employing Hungarian matching for deterministic targets, and training with both fully and partially masked sequences, MAD learns rich task contexts and achieves fast, single-branch multi-task inference. Experiments on COCO demonstrate competitive accuracy across object detection, instance segmentation, keypoint detection, and image captioning, along with substantial inference speedups compared with autoregressive and task-specific baselines. This approach offers a scalable paradigm for vision generalists and lays groundwork for expanding task coverage with minimal task-specific engineering.

Abstract

Inspired by the success of general-purpose models in NLP, recent studies attempt to unify different vision tasks in the same sequence format and employ autoregressive Transformers for sequence prediction. They apply uni-directional attention to capture sequential dependencies and generate task sequences recursively. However, such autoregressive Transformers may not fit vision tasks well, as vision task sequences usually lack the sequential dependencies typically observed in natural languages. In this work, we design Masked AutoDecoder~(MAD), an effective multi-task vision generalist. MAD consists of two core designs. First, we develop a parallel decoding framework that introduces bi-directional attention to capture contextual dependencies comprehensively and decode vision task sequences in parallel. Second, we design a masked sequence modeling approach that learns rich task contexts by masking and reconstructing task sequences. In this way, MAD handles all the tasks by a single network branch and a simple cross-entropy loss with minimal task-specific designs. Extensive experiments demonstrate the great potential of MAD as a new paradigm for unifying various vision tasks. MAD achieves superior performance and inference efficiency compared to autoregressive counterparts while obtaining competitive accuracy with task-specific models. Code will be released.
Paper Structure (17 sections, 1 equation, 8 figures, 9 tables)

This paper contains 17 sections, 1 equation, 8 figures, 9 tables.

Figures (8)

  • Figure 1: The proposed MAD outperforms the state-of-the-art Pix2SeqV2 pix2seqv2 significantly in inference time, meanwhile achieves competitive accuracy across four representative vision tasks. The Average Performance is averaged over four tasks including object detection (mAP), instance segmentation (mAP), keypoint detection (mAP), and image captioning (B@4). MAD achieves approximately 100$\times$ acceleration in inference time.
  • Figure 2: Unified Sequence-to-sequence Modeling of Vision Tasks. The traditional Autoregressive Decoder adopts sequential decoding for prediction, and utilizes unidirectional attention where each token can only attend to its previous ones. It generates task sequences token by token, resulting in a slow generation process with $N$ steps up to the length of sequences. The proposed Masked AutoDecoder (MAD), equipped with parallel decoding with bidirectional attention and masked sequence modeling, allows decoding task sequences with only one step. Additionally, via masking and reconstructing task sequences, MAD can capture rich task contexts for different tasks, resulting in an effective and efficient vision generalist. Tokens in blue denote task prompts. $<mask>$ denotes mask token. Task Sequences are simplified with details to be described in the ensuing Sections.
  • Figure 3: Illustration of our proposed MAD with masked training and masked inference. MAD consists of two major parts, a Backbone + Encoder to extract the representation of the input images and a Decoder that processes Masked Sequences for prediction. During training, MAD randomly masks Task Sequences (blue for prompt tokens and yellow for task tokens) to generate Masked Sequences (white for masked tokens) and learns to reconstruct the Task Sequences. During inference, MAD takes fully Masked Sequences as input and repeats the decoding and masking process to refine predictions. When K in masked inference is set to 0, MAD can skip refinement and directly generate prediction in one step.
  • Figure 4: Convergence curves for Autoregressive Decoding, Parallel Decoding, and the proposed MAD in \ref{['tab:main_components']}. MAD achieves much faster convergence for vision-centric tasks and greatly narrows the gap with Autoregressive Decoding compared with Parallel Decoding for image captioning.
  • Figure 5: Instance segmentation Weights.
  • ...and 3 more figures