Table of Contents
Fetching ...

A Spitting Image: Modular Superpixel Tokenization in Vision Transformers

Marius Aasan, Odd Kolbjørnsen, Anne Schistad Solberg, Adín Ramirez Rivera

TL;DR

Vision Transformers traditionally tokenize images with fixed square patches, coupling token scale to architecture. This work introduces SPiT, a modular, online superpixel tokenizer that decouples tokenization from feature extraction through a graph-based, hierarchical partitioning with $T$ levels and kernelized positional/color/texture features, forming a g = γ ∘ φ ∘ τ framework that subsumes canonical ViT as a special case. SPiT yields irregular, semantically aligned tokens with pixel-level granularity and improves attribution faithfulness while maintaining competitive classification and enabling unsupervised segmentation without decoders. Empirical results on ImageNet1k and downstream datasets demonstrate stronger interpretable attributions and robust segmentation, illustrating a scalable path to richer ViT families and broader applicability of modular tokenization in vision transformers.

Abstract

Vision Transformer (ViT) architectures traditionally employ a grid-based approach to tokenization independent of the semantic content of an image. We propose a modular superpixel tokenization strategy which decouples tokenization and feature extraction; a shift from contemporary approaches where these are treated as an undifferentiated whole. Using on-line content-aware tokenization and scale- and shape-invariant positional embeddings, we perform experiments and ablations that contrast our approach with patch-based tokenization and randomized partitions as baselines. We show that our method significantly improves the faithfulness of attributions, gives pixel-level granularity on zero-shot unsupervised dense prediction tasks, while maintaining predictive performance in classification tasks. Our approach provides a modular tokenization framework commensurable with standard architectures, extending the space of ViTs to a larger class of semantically-rich models.

A Spitting Image: Modular Superpixel Tokenization in Vision Transformers

TL;DR

Vision Transformers traditionally tokenize images with fixed square patches, coupling token scale to architecture. This work introduces SPiT, a modular, online superpixel tokenizer that decouples tokenization from feature extraction through a graph-based, hierarchical partitioning with levels and kernelized positional/color/texture features, forming a g = γ ∘ φ ∘ τ framework that subsumes canonical ViT as a special case. SPiT yields irregular, semantically aligned tokens with pixel-level granularity and improves attribution faithfulness while maintaining competitive classification and enabling unsupervised segmentation without decoders. Empirical results on ImageNet1k and downstream datasets demonstrate stronger interpretable attributions and robust segmentation, illustrating a scalable path to richer ViT families and broader applicability of modular tokenization in vision transformers.

Abstract

Vision Transformer (ViT) architectures traditionally employ a grid-based approach to tokenization independent of the semantic content of an image. We propose a modular superpixel tokenization strategy which decouples tokenization and feature extraction; a shift from contemporary approaches where these are treated as an undifferentiated whole. Using on-line content-aware tokenization and scale- and shape-invariant positional embeddings, we perform experiments and ablations that contrast our approach with patch-based tokenization and randomized partitions as baselines. We show that our method significantly improves the faithfulness of attributions, gives pixel-level granularity on zero-shot unsupervised dense prediction tasks, while maintaining predictive performance in classification tasks. Our approach provides a modular tokenization framework commensurable with standard architectures, extending the space of ViTs to a larger class of semantically-rich models.
Paper Structure (41 sections, 3 theorems, 16 equations, 14 figures, 10 tables)

This paper contains 41 sections, 3 theorems, 16 equations, 14 figures, 10 tables.

Key Result

proposition thmcounterproposition

Let $\tau^*$ denote an canonical ViT tokenizer with a fixed patch size $\rho$, let $\phi$ denote a gradient excluding interpolated feature extractor, and let $\gamma^*, \gamma$ denote embedding layers with equivalent linear projections $L^*_\theta = L_\theta$. Let $\hat{\xi}^{\mathop{\mathrm{(\mathr

Figures (14)

  • Figure 1: Tokenized image and attributions for prediction "grass snake" with different tokenizers: square patches (ViT), Voronoi tesselation (RViT) and superpixels (SPiT). We show more results in Appendix \ref{['sec:attmaps']} .
  • Figure 2: Illustration of modular tokenization in ViT architecture.
  • Figure 3: Visualization of superpixel aggregation.
  • Figure 4: Non-cherry picked samples ({0257..0264}.jpg) of unsupervised zero-shot segmentation results on ECSSD.
  • Figure 5: Feature correspondences from a source image (left) to target images (right), mapped via normalized single head cross attention and colored using low rank PCA. We show more results in Appendix \ref{['sec:ftcorr']} .
  • ...and 9 more figures

Theorems & Definitions (9)

  • proposition thmcounterproposition: Embedding Equivalence
  • definition thmcounterdefinition: ViT Tokenization
  • definition thmcounterdefinition: ViT Features
  • definition thmcounterdefinition: ViT Embedder
  • lemma thmcounterlemma: Feature Equivalence
  • proof
  • proposition thmcounterproposition: Embedding Equivalence
  • proof
  • remark thmcounterremark