Table of Contents
Fetching ...

SPFormer: Enhancing Vision Transformer with Superpixel Representation

Jieru Mei, Liang-Chieh Chen, Alan Yuille, Cihang Xie

TL;DR

SPFormer tackles the gap between pixel-level detail and patch-based global reasoning by introducing an adaptive superpixel representation for Vision Transformers. It pairs a trainable superpixel module with Superpixel Cross Attention to iteratively fuse pixel and superpixel information, enabling efficient global interactions over a compact token set. The approach yields improvements on ImageNet over DeiT baselines, enhances segmentation performance through high-resolution feature preservation, and provides explainability via visualizable pixel–superpixel associations that align with semantic boundaries. The results demonstrate robust performance under rotations and occlusions, highlighting the practical potential of region-aware representations in scalable vision models.

Abstract

In this work, we introduce SPFormer, a novel Vision Transformer enhanced by superpixel representation. Addressing the limitations of traditional Vision Transformers' fixed-size, non-adaptive patch partitioning, SPFormer employs superpixels that adapt to the image's content. This approach divides the image into irregular, semantically coherent regions, effectively capturing intricate details and applicable at both initial and intermediate feature levels. SPFormer, trainable end-to-end, exhibits superior performance across various benchmarks. Notably, it exhibits significant improvements on the challenging ImageNet benchmark, achieving a 1.4% increase over DeiT-T and 1.1% over DeiT-S respectively. A standout feature of SPFormer is its inherent explainability. The superpixel structure offers a window into the model's internal processes, providing valuable insights that enhance the model's interpretability. This level of clarity significantly improves SPFormer's robustness, particularly in challenging scenarios such as image rotations and occlusions, demonstrating its adaptability and resilience.

SPFormer: Enhancing Vision Transformer with Superpixel Representation

TL;DR

SPFormer tackles the gap between pixel-level detail and patch-based global reasoning by introducing an adaptive superpixel representation for Vision Transformers. It pairs a trainable superpixel module with Superpixel Cross Attention to iteratively fuse pixel and superpixel information, enabling efficient global interactions over a compact token set. The approach yields improvements on ImageNet over DeiT baselines, enhances segmentation performance through high-resolution feature preservation, and provides explainability via visualizable pixel–superpixel associations that align with semantic boundaries. The results demonstrate robust performance under rotations and occlusions, highlighting the practical potential of region-aware representations in scalable vision models.

Abstract

In this work, we introduce SPFormer, a novel Vision Transformer enhanced by superpixel representation. Addressing the limitations of traditional Vision Transformers' fixed-size, non-adaptive patch partitioning, SPFormer employs superpixels that adapt to the image's content. This approach divides the image into irregular, semantically coherent regions, effectively capturing intricate details and applicable at both initial and intermediate feature levels. SPFormer, trainable end-to-end, exhibits superior performance across various benchmarks. Notably, it exhibits significant improvements on the challenging ImageNet benchmark, achieving a 1.4% increase over DeiT-T and 1.1% over DeiT-S respectively. A standout feature of SPFormer is its inherent explainability. The superpixel structure offers a window into the model's internal processes, providing valuable insights that enhance the model's interpretability. This level of clarity significantly improves SPFormer's robustness, particularly in challenging scenarios such as image rotations and occlusions, demonstrating its adaptability and resilience.
Paper Structure (25 sections, 5 equations, 6 figures, 6 tables)

This paper contains 25 sections, 5 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Visualization of learned superpixels with our SPFormer trained on ImageNet with category labels only. For each row, we show input image, visualization of 196, 49, and 16 superpixels. The learned superpixel aligns well with the object boundaries even with 16 superpixels. The last row shows results from a COCO image (not trained), demonstrating SPFormer's zero-shot ability.
  • Figure 2: Illustration of our SCA module for iterative refinement of both superpixel and pixel features using a sliding window-based cross-attention mechanism. Each superpixel cross-attends to a localized region of pixels, as highlighted in the colored rectangle. On the left, we detail the Pixel-to-Superpixel (P2S) cross-attention process, while the Superpixel-to-Pixel (S2P) cross-attention is depicted similarly, albeit with reversed roles for superpixel and pixel.
  • Figure 3: Illustration of a single stage of the SPFormer architecture. It starts with initial superpixel features and pixel features as inputs. The SCA module iteratively refines superpixel features, enhancing their semantic richness. These features are then processed by the Multi-Head Self-Attention (MHSA) for global contextual understanding. The stage concludes by updating the pixel features based on the enriched superpixel information, readying them for the next stage or for final pooling and classification. This design showcases the efficient integration of local detail and global context in SPFormer.
  • Figure 4: The multi-head SCA design generates multiple superpixel representations, each capturing different semantic relationships and addressing the ambiguity in superpixel over-segmentation.
  • Figure 5: Zero-shot transferability on the COCO dataset. Trained solely on ImageNet, SPFormer demonstrates effective segmentation of unseen COCO images into detailed superpixels. 196 superpixels are used in this visualization.
  • ...and 1 more figures