Table of Contents
Fetching ...

CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones

Wenyi Gong, Mieszko Lis

TL;DR

This paper introduces a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures and demonstrates strong performance both off-the-shelf and with fine-tuning.

Abstract

Many modern ViT backbones adopt spatial architectural designs, such as window attention, decomposed relative positional embeddings in SAM, and RoPE in DINOv3. Such architectures impose new challenges on token reduction, as the vast majority of existing methods fail to preserve the spatial structure these architectures depend on. In this paper, we introduce a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. We reconcile two seemingly conflicting requirements: (i)exploiting the uneven information distribution across the spatial layout while (ii)preserving the spatial structure post-merging. Our approach employs (i)a 2D reduction strategy to enforce structured token layouts, (ii)a spatial-aware merging algorithm that maintains relative token positions, and (iii)a novel max-magnitude-per-dimension token representation that preserves salient features. Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-art results on spatial and non-spatial architectures across various vision tasks. Specifically, we achieve 1.25x speedup on SAM-H with only 0.7% mIOU drop evaluated on COCO off-the-shelf, and 1.15x speedup on DeiT-B with no top-1 accuracy drop on ImageNet within just one epoch of fine-tuning.

CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones

TL;DR

This paper introduces a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures and demonstrates strong performance both off-the-shelf and with fine-tuning.

Abstract

Many modern ViT backbones adopt spatial architectural designs, such as window attention, decomposed relative positional embeddings in SAM, and RoPE in DINOv3. Such architectures impose new challenges on token reduction, as the vast majority of existing methods fail to preserve the spatial structure these architectures depend on. In this paper, we introduce a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. We reconcile two seemingly conflicting requirements: (i)exploiting the uneven information distribution across the spatial layout while (ii)preserving the spatial structure post-merging. Our approach employs (i)a 2D reduction strategy to enforce structured token layouts, (ii)a spatial-aware merging algorithm that maintains relative token positions, and (iii)a novel max-magnitude-per-dimension token representation that preserves salient features. Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-art results on spatial and non-spatial architectures across various vision tasks. Specifically, we achieve 1.25x speedup on SAM-H with only 0.7% mIOU drop evaluated on COCO off-the-shelf, and 1.15x speedup on DeiT-B with no top-1 accuracy drop on ImageNet within just one epoch of fine-tuning.

Paper Structure

This paper contains 20 sections, 3 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: (a$\rightarrow$b): Most token merging methods, like ToMe shown here, fail to preserve spatial layouts. (a$\rightarrow$c): Expedite preserves spatial structure, but fails to exploit information density unevenness across regions, losing information. (a$\rightarrow$d): CubistMerge preserves spatial coherence while focusing token reduction on regions with low information density.
  • Figure 2: Attention patterns with relative positional bias towards 5 different token positions (indicated by red star) on SAM-B. (a) shows attention map of baseline model. (b) shows effective attention pattern with ToMe applied (c) shows effective attention pattern with our method applied. Our method preserves attention patterns better than non-spatial-preserving method like ToMe.
  • Figure 3: 2D token reduction with spatial-aware merging: (1) original 14×14 tokens, (2) select horizontal tokens to merge, (3) merge horizontally to 14×12 tokens, (4) select vertical tokens to merge, (5) merge vertically to 12×12 tokens.
  • Figure 4: Illustration of edge selection algorithms. Arrows on selected edges (in orange) indicate the direction of token merging, pointing from the source token to the destination token. The numbers on edges represent execution order of merging required by dependencies. (a) Path graph with naive edge selection, requiring sequential execution. (b) Path graph with naive edge selection, optimized with reduction tree to $O(\log N)$ complexity. (c) Path graph with bipartite edge selection to eliminate dependencies by ensuring each source token (in red) can only merge to one destination (in blue).
  • Figure 5: Image classification results on spatial architectures, varying $r_h = r_w = 1, 2, 3$ with $l=10$ on MViTv2-B, $l=20$ on DINOv3-ViT7B and MViTv2-L.
  • ...and 8 more figures