Table of Contents
Fetching ...

SPoT: Subpixel Placement of Tokens in Vision Transformers

Martine Hjelkrem-Tan, Marius Aasan, Gabriel Y. Arteaga, Adín Ramírez Rivera

TL;DR

Subpixel Placement of Tokens (SPoT), a novel tokenization strategy that positions tokens continuously within images, effectively sidestepping grid-based limitations is proposed, redefining sparsity as a strategic advantage rather than an imposed limitation.

Abstract

Vision Transformers naturally accommodate sparsity, yet standard tokenization methods confine features to discrete patch grids. This constraint prevents models from fully exploiting sparse regimes, forcing awkward compromises. We propose Subpixel Placement of Tokens (SPoT), a novel tokenization strategy that positions tokens continuously within images, effectively sidestepping grid-based limitations. With our proposed oracle-guided search, we uncover substantial performance gains achievable with ideal subpixel token positioning, drastically reducing the number of tokens necessary for accurate predictions during inference. SPoT provides a new direction for flexible, efficient, and interpretable ViT architectures, redefining sparsity as a strategic advantage rather than an imposed limitation.

SPoT: Subpixel Placement of Tokens in Vision Transformers

TL;DR

Subpixel Placement of Tokens (SPoT), a novel tokenization strategy that positions tokens continuously within images, effectively sidestepping grid-based limitations is proposed, redefining sparsity as a strategic advantage rather than an imposed limitation.

Abstract

Vision Transformers naturally accommodate sparsity, yet standard tokenization methods confine features to discrete patch grids. This constraint prevents models from fully exploiting sparse regimes, forcing awkward compromises. We propose Subpixel Placement of Tokens (SPoT), a novel tokenization strategy that positions tokens continuously within images, effectively sidestepping grid-based limitations. With our proposed oracle-guided search, we uncover substantial performance gains achievable with ideal subpixel token positioning, drastically reducing the number of tokens necessary for accurate predictions during inference. SPoT provides a new direction for flexible, efficient, and interpretable ViT architectures, redefining sparsity as a strategic advantage rather than an imposed limitation.

Paper Structure

This paper contains 20 sections, 7 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: (Left) A standard ViT splits the image into a fixed grid of non‐overlapping patches. (Right) With SPoT, a continuously sampled set of subpixel-precise patches are extracted.
  • Figure 2: Grids cannot align over all key features.\ref{['fig:in-grid']} A $5 \times 5$ patch grid (gray) with three optimal region placements for sparse feature selection. The green patch is well aligned (A), yellow straddles two cells (B), and red lies on a corner (C) and leaks into four cells. Translating the grid only swaps which peak is misaligned---one patch is always bad. \ref{['fig:off-grid']} Our subpixel tokenizer drops fixed-size windows (green squares) directly on each peak, eliminating the alignment trade-off while still allowing conventional grid tokens when they are well aligned.
  • Figure 3: Different sampling priors which can be employed with SPoT. The Sobol prior (not figured) produces uniform quasirandom placements with explicit constraints on coverage.
  • Figure 4: Illustration of oracle placements with 25 tokens with SPoT-ON. By optimizing our oracle-neighborhood search equation \ref{['eq:oracle_optimization']} all the way through the model, the oracle discovers optimal placement of points, yielding an accuracy of $90.9\%$ on ImageNet1k with only $\sim12.5\%$ of the tokens. Trajectories are colored starting with dark purple for initial points, with endpoints colored bright yellow.
  • Figure 5: We show ImageNet1k accuracy vs. throughput with 5 models at four sparsity levels. The ceiling denotes performance unlikely to be achieved given the intrinsic label noise in ImageNet beyer2020imgnetreal. The gap highlights the margin between SPoT with optimal configuration and SPoT-ON, illustrating possible performance gain through better token placement.
  • ...and 3 more figures