Table of Contents
Fetching ...

All in Tokens: Unifying Output Space of Visual Tasks via Soft Token

Jia Ning, Chen Li, Zheng Zhang, Zigang Geng, Qi Dai, Kun He, Han Hu

TL;DR

This work tackles the challenge of unifying the output spaces across diverse visual tasks by introducing AiT, a general-purpose solver built on a lightweight VQ-VAE tokenizer-detokenizer and an auto-regressive task-solver. It introduces soft tokens to create a continuous, learnable embedding space for token predictions and mask augmentation to handle undefined/occluded regions in annotations, enabling simultaneous handling of depth estimation and instance segmentation. The approach achieves state-of-the-art depth accuracy on NYUv2 and competitive results on COCO, with a parallel decoding variant offering further gains, and demonstrates the practicality of a single model for multiple tasks. These findings suggest a viable path toward general-purpose visual task solvers and potential extension to a broader class of vision problems.

Abstract

Unlike language tasks, where the output space is usually limited to a set of tokens, the output space of visual tasks is more complicated, making it difficult to build a unified visual model for various visual tasks. In this paper, we seek to unify the output space of visual tasks, so that we can also build a unified model for visual tasks. To this end, we demonstrate a single unified model that simultaneously handles two typical visual tasks of instance segmentation and depth estimation, which have discrete/fixed-length and continuous/varied-length outputs, respectively. We propose several new techniques that take into account the particularity of visual tasks: 1) Soft token. We employ soft token to represent the task output. Unlike hard tokens in the common VQ-VAE which are assigned one-hot to discrete codebooks/vocabularies, the soft token is assigned softly to the codebook embeddings. Soft token can improve the accuracy of both the next token inference and decoding of the task output; 2) Mask augmentation. Many visual tasks have corruption, undefined or invalid values in label annotations, i.e., occluded area of depth maps. We show that a mask augmentation technique can greatly benefit these tasks. With these new techniques and other designs, we show that the proposed general-purpose task-solver can perform both instance segmentation and depth estimation well. Particularly, we achieve 0.279 RMSE on the specific task of NYUv2 depth estimation, setting a new record on this benchmark. The general-purpose task-solver, dubbed AiT, is available at \url{https://github.com/SwinTransformer/AiT}.

All in Tokens: Unifying Output Space of Visual Tasks via Soft Token

TL;DR

This work tackles the challenge of unifying the output spaces across diverse visual tasks by introducing AiT, a general-purpose solver built on a lightweight VQ-VAE tokenizer-detokenizer and an auto-regressive task-solver. It introduces soft tokens to create a continuous, learnable embedding space for token predictions and mask augmentation to handle undefined/occluded regions in annotations, enabling simultaneous handling of depth estimation and instance segmentation. The approach achieves state-of-the-art depth accuracy on NYUv2 and competitive results on COCO, with a parallel decoding variant offering further gains, and demonstrates the practicality of a single model for multiple tasks. These findings suggest a viable path toward general-purpose visual task solvers and potential extension to a broader class of vision problems.

Abstract

Unlike language tasks, where the output space is usually limited to a set of tokens, the output space of visual tasks is more complicated, making it difficult to build a unified visual model for various visual tasks. In this paper, we seek to unify the output space of visual tasks, so that we can also build a unified model for visual tasks. To this end, we demonstrate a single unified model that simultaneously handles two typical visual tasks of instance segmentation and depth estimation, which have discrete/fixed-length and continuous/varied-length outputs, respectively. We propose several new techniques that take into account the particularity of visual tasks: 1) Soft token. We employ soft token to represent the task output. Unlike hard tokens in the common VQ-VAE which are assigned one-hot to discrete codebooks/vocabularies, the soft token is assigned softly to the codebook embeddings. Soft token can improve the accuracy of both the next token inference and decoding of the task output; 2) Mask augmentation. Many visual tasks have corruption, undefined or invalid values in label annotations, i.e., occluded area of depth maps. We show that a mask augmentation technique can greatly benefit these tasks. With these new techniques and other designs, we show that the proposed general-purpose task-solver can perform both instance segmentation and depth estimation well. Particularly, we achieve 0.279 RMSE on the specific task of NYUv2 depth estimation, setting a new record on this benchmark. The general-purpose task-solver, dubbed AiT, is available at \url{https://github.com/SwinTransformer/AiT}.
Paper Structure (29 sections, 6 figures, 11 tables)

This paper contains 29 sections, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Illustration of our unified framework with two stages. In this framework, various vision task outputs are transferred to discrete token space by a VQ-VAE tokenizer. In this way, discrete or continuous visual tasks can be converted into one discrete classified task,
  • Figure 2: Illustration of instance segmentation and depth estimation token format. (a) We organize every object by 21 tokens. For positive objects, this format includes 4 box, 1 label and 16 mask tokens. While for noise tokens, 4 noise, 1 background and 16 zero tokens are employed; (b) For dense tasks like depth estimation, we treat every patch a token to form a token map.
  • Figure 3: There are some corrupted regions (black regions/pixels) in the GT depth map. While we have ignored these regions in training VQ-VAE as well, the reconstructed regions are still abnormal, which is reflected in the shadows in reconstruction results. This phenomenon can be alleviated by adding masked augmentation.
  • Figure 4:
  • Figure 5: Visualization on instance segmentation task of our method.
  • ...and 1 more figures