Table of Contents
Fetching ...

Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Jaihyun Lew, Soohyuk Jang, Jaehoon Lee, Seungryong Yoo, Eunji Kim, Saehyung Lee, Jisoo Mok, Siwon Kim, Sungroh Yoon

TL;DR

This paper tackles semantic integrity in Vision Transformer tokenization by replacing fixed grid patches with superpixel-based tokens. It introduces SuiT, a two-stage tokenization pipeline that first builds pixel-level embeddings and then aggregates them via superpixel-aware pooling to generate one token per superpixel of dimension $D$, accommodating irregular shapes and locations. Across ImageNet-1K, transfer learning, and zero-shot segmentation, SuiT consistently outperforms strong baselines, demonstrates adaptive inference by varying token counts, and preserves semantic coherence in token representations. The method is plug-and-play with vanilla ViT backbones and offers improved robustness and interpretability, with broad implications for efficient and scalable visual representations.

Abstract

Transformers, a groundbreaking architecture proposed for Natural Language Processing (NLP), have also achieved remarkable success in Computer Vision. A cornerstone of their success lies in the attention mechanism, which models relationships among tokens. While the tokenization process in NLP inherently ensures that a single token does not contain multiple semantics, the tokenization of Vision Transformer (ViT) utilizes tokens from uniformly partitioned square image patches, which may result in an arbitrary mixing of visual concepts in a token. In this work, we propose to substitute the grid-based tokenization in ViT with superpixel tokenization, which employs superpixels to generate a token that encapsulates a sole visual concept. Unfortunately, the diverse shapes, sizes, and locations of superpixels make integrating superpixels into ViT tokenization rather challenging. Our tokenization pipeline, comprised of pre-aggregate extraction and superpixel-aware aggregation, overcomes the challenges that arise in superpixel tokenization. Extensive experiments demonstrate that our approach, which exhibits strong compatibility with existing frameworks, enhances the accuracy and robustness of ViT on various downstream tasks.

Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

TL;DR

This paper tackles semantic integrity in Vision Transformer tokenization by replacing fixed grid patches with superpixel-based tokens. It introduces SuiT, a two-stage tokenization pipeline that first builds pixel-level embeddings and then aggregates them via superpixel-aware pooling to generate one token per superpixel of dimension , accommodating irregular shapes and locations. Across ImageNet-1K, transfer learning, and zero-shot segmentation, SuiT consistently outperforms strong baselines, demonstrates adaptive inference by varying token counts, and preserves semantic coherence in token representations. The method is plug-and-play with vanilla ViT backbones and offers improved robustness and interpretability, with broad implications for efficient and scalable visual representations.

Abstract

Transformers, a groundbreaking architecture proposed for Natural Language Processing (NLP), have also achieved remarkable success in Computer Vision. A cornerstone of their success lies in the attention mechanism, which models relationships among tokens. While the tokenization process in NLP inherently ensures that a single token does not contain multiple semantics, the tokenization of Vision Transformer (ViT) utilizes tokens from uniformly partitioned square image patches, which may result in an arbitrary mixing of visual concepts in a token. In this work, we propose to substitute the grid-based tokenization in ViT with superpixel tokenization, which employs superpixels to generate a token that encapsulates a sole visual concept. Unfortunately, the diverse shapes, sizes, and locations of superpixels make integrating superpixels into ViT tokenization rather challenging. Our tokenization pipeline, comprised of pre-aggregate extraction and superpixel-aware aggregation, overcomes the challenges that arise in superpixel tokenization. Extensive experiments demonstrate that our approach, which exhibits strong compatibility with existing frameworks, enhances the accuracy and robustness of ViT on various downstream tasks.

Paper Structure

This paper contains 39 sections, 6 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: A high-level overview of tokenization. (Top) The conventional grid-like tokenization in ViT and (Bottom) our proposed superpixel tokenization.
  • Figure 1: Comparison of classification performance on ImageNet-1K imagenet. Following prior studies, we condcuted experiments on two different settings for fair comparisons. In the "Weight-Init." column, "Random" refers to from-scratch training on ImageNet-1K imagenet, while "IN-21K" refers to training from ViT weights pre-trained on ImageNet-21K imagenet_21k. $^\dagger$ denotes that the method was initialized with IN-21K pre-trained + IN-1K fine-tuned ViT weights. Cells marked '-' means that the value is not available.
  • Figure 2: Overview of our superpixel tokenization pipeline. Local features are extracted and combined with positional encodings, followed by superpixel-aware aggregation using average and max pooling to produce superpixel tokens, which are fed into Vision Transformer.
  • Figure 3: Adaptive inference. Under the same image resolution (=224), DeiT deit is forced to use a fixed # tokens (=196), whereas both SuiT and SPiT spit can adaptively adjust the number of tokens during inference. Notably, across all token counts and model scales, SuiT consistently achieves the best performance compared to baselines under the same computational cost.
  • Figure 4: Qualitative results of zero-shot salient object segmentation. $^{\dagger}$ denotes the model with additional post-processing. DINO-SuiT successfully detects salient objects both in single- and multi-object scenarios without any post-processing.
  • ...and 7 more figures