A Spitting Image: Modular Superpixel Tokenization in Vision Transformers
Marius Aasan, Odd Kolbjørnsen, Anne Schistad Solberg, Adín Ramirez Rivera
TL;DR
Vision Transformers traditionally tokenize images with fixed square patches, coupling token scale to architecture. This work introduces SPiT, a modular, online superpixel tokenizer that decouples tokenization from feature extraction through a graph-based, hierarchical partitioning with $T$ levels and kernelized positional/color/texture features, forming a g = γ ∘ φ ∘ τ framework that subsumes canonical ViT as a special case. SPiT yields irregular, semantically aligned tokens with pixel-level granularity and improves attribution faithfulness while maintaining competitive classification and enabling unsupervised segmentation without decoders. Empirical results on ImageNet1k and downstream datasets demonstrate stronger interpretable attributions and robust segmentation, illustrating a scalable path to richer ViT families and broader applicability of modular tokenization in vision transformers.
Abstract
Vision Transformer (ViT) architectures traditionally employ a grid-based approach to tokenization independent of the semantic content of an image. We propose a modular superpixel tokenization strategy which decouples tokenization and feature extraction; a shift from contemporary approaches where these are treated as an undifferentiated whole. Using on-line content-aware tokenization and scale- and shape-invariant positional embeddings, we perform experiments and ablations that contrast our approach with patch-based tokenization and randomized partitions as baselines. We show that our method significantly improves the faithfulness of attributions, gives pixel-level granularity on zero-shot unsupervised dense prediction tasks, while maintaining predictive performance in classification tasks. Our approach provides a modular tokenization framework commensurable with standard architectures, extending the space of ViTs to a larger class of semantically-rich models.
