Table of Contents
Fetching ...

Your ViT is Secretly an Image Segmentation Model

Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, Daan de Geus

TL;DR

This paper demonstrates that the need for task-specific components in ViT-based image segmentation diminishes as model size and pretraining scale up. It introduces the Encoder-only Mask Transformer (EoMT), which repurposes a plain ViT with a small mask module and a mask-annealing training strategy to perform segmentation without a decoder or masked attention at inference. Empirically, EoMT achieves competitive performance across panoptic, semantic, and instance segmentation while delivering substantial speedups, with performance improving as ViT size and pretraining scale (e.g., DINOv2/EVA-02 pretraining) are increased. The work argues for allocating compute toward scaling ViTs and foundation-model pretraining rather than adding architectural complexity, establishing a simple, scalable baseline for future segmentation research.

Abstract

Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity. Code: https://www.tue-mps.org/eomt/.

Your ViT is Secretly an Image Segmentation Model

TL;DR

This paper demonstrates that the need for task-specific components in ViT-based image segmentation diminishes as model size and pretraining scale up. It introduces the Encoder-only Mask Transformer (EoMT), which repurposes a plain ViT with a small mask module and a mask-annealing training strategy to perform segmentation without a decoder or masked attention at inference. Empirically, EoMT achieves competitive performance across panoptic, semantic, and instance segmentation while delivering substantial speedups, with performance improving as ViT size and pretraining scale (e.g., DINOv2/EVA-02 pretraining) are increased. The work argues for allocating compute toward scaling ViTs and foundation-model pretraining rather than adding architectural complexity, establishing a simple, scalable baseline for future segmentation research.

Abstract

Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity. Code: https://www.tue-mps.org/eomt/.

Paper Structure

This paper contains 19 sections, 2 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: ViT-Adapter + Mask2Former vs. EoMT (Ours). EoMT demonstrates an optimal balance between Panoptic Quality (PQ) and FPS across different sizes of DINOv2 oquab2023dinov2 pre-trained ViTs dosovitskiy2021vit. Evaluation on COCO val2017lin2014coco, see \ref{['tab:model_size']}.
  • Figure 2: EoMT architecture. Learnable queries are concatenated to the patch tokens after the first $L_1$ ViT encoder blocks. These concatenated tokens are then jointly processed by the last $L_2$ blocks and used to predict class and mask logits.
  • Figure 3: Masked self-attention during training. In the final $L_2$ blocks of EoMT, patch tokens and queries are jointly processed by self-attention. During training, the intermediate mask predictions are used to mask the query-to-patch portion of the attention operation, mimicking the masked cross-attention of M2F cheng2022mask2former.
  • Figure 4: Mask annealing during training. Self-attention is initially masked in the final $L_2$ ($=4$ for ViT-L) EoMT blocks. The masking probability is gradually annealed, starting from early blocks, until it is no longer needed at the end of training.
  • Figure A: Removing task-specific components. We visualize the architectures of the resulting intermediate configurations.
  • ...and 2 more figures