Table of Contents
Fetching ...

StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models

Duy M. H. Nguyen, Tuan A. Tran, Duong Nguyen, Siwei Xie, Trung Q. Nguyen, Mai T. N. Truong, Daniel Palenicek, An T. Le, Michael Barz, TrungTin Nguyen, Tuan Dam, Ngan Le, Minh Vu, Khoa Doan, Vien Ngo, Pengtao Xie, James Zou, Daniel Sonntag, Jan Peters, Mathias Niepert

TL;DR

A resolution-preserving merge-unmerge framework tailored to SAM, which computes a lightweight token-energy score from first-order feature gradients, uses grid-based flatness screening to protect boundary and prompt regions, and merges tokens within flat areas toward low-energy destinations with explicit token recovery.

Abstract

Recent token merging techniques for Vision Transformers (ViTs) provide substantial speedups by reducing the number of tokens processed by self-attention, often without retraining. However, their direct application to the Segment Anything Model (SAM) family is nontrivial: SAM's image encoder mixes windowed and global attention, and its mask decoder relies on dense, prompt-conditioned features for precise boundary prediction. We systematically evaluate representative token-merging methods on SAM and Medical SAM in a strict off-the-shelf setting, and find that existing destination-selection heuristics can erode boundaries and leak prompt information as merge rates increase. We propose \textbf{StructSAM}, a resolution-preserving merge-unmerge framework tailored to SAM. StructSAM computes a lightweight token-energy score from first-order feature gradients, uses grid-based flatness screening to protect boundary and prompt regions, and merges tokens within flat areas toward low-energy destinations with explicit token recovery. We further provide a spectral graph coarsening view showing that score-guided merging yields bounded Laplacian spectral distortion compared to random or window-restricted baselines. Across eight natural and medical benchmarks, StructSAM reduces encoder FLOPs by 25-30\% (up to 40\%+ with prompt-aware merging) with minor drops in mIoU/Dice, consistently outperforming ToMe, PiToMe, ToMeSD, VidToMe, and ALGM at the same compute.

StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models

TL;DR

A resolution-preserving merge-unmerge framework tailored to SAM, which computes a lightweight token-energy score from first-order feature gradients, uses grid-based flatness screening to protect boundary and prompt regions, and merges tokens within flat areas toward low-energy destinations with explicit token recovery.

Abstract

Recent token merging techniques for Vision Transformers (ViTs) provide substantial speedups by reducing the number of tokens processed by self-attention, often without retraining. However, their direct application to the Segment Anything Model (SAM) family is nontrivial: SAM's image encoder mixes windowed and global attention, and its mask decoder relies on dense, prompt-conditioned features for precise boundary prediction. We systematically evaluate representative token-merging methods on SAM and Medical SAM in a strict off-the-shelf setting, and find that existing destination-selection heuristics can erode boundaries and leak prompt information as merge rates increase. We propose \textbf{StructSAM}, a resolution-preserving merge-unmerge framework tailored to SAM. StructSAM computes a lightweight token-energy score from first-order feature gradients, uses grid-based flatness screening to protect boundary and prompt regions, and merges tokens within flat areas toward low-energy destinations with explicit token recovery. We further provide a spectral graph coarsening view showing that score-guided merging yields bounded Laplacian spectral distortion compared to random or window-restricted baselines. Across eight natural and medical benchmarks, StructSAM reduces encoder FLOPs by 25-30\% (up to 40\%+ with prompt-aware merging) with minor drops in mIoU/Dice, consistently outperforming ToMe, PiToMe, ToMeSD, VidToMe, and ALGM at the same compute.
Paper Structure (39 sections, 7 theorems, 35 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 39 sections, 7 theorems, 35 equations, 8 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Fix an encoder layer $\ell$ with windows $\{\mathcal{P}_k\}_{k=1}^{K_\ell}$. Let $\text{SD}_{\ell}(\mathrm{SG})$ denote the spectral discrepancy induced by StructSAM’s score-guided merging, and $\text{SD}_{\ell}(\mathrm{Base})$ that of a non--score-guided baseline (e.g., random or stride-based desti

Figures (8)

  • Figure 1: SAM's encoder with global and windowed attention.
  • Figure 2: StructSAM overview. Feature-gradient energy identifies structurally important regions, forming a protected set that is kept at full resolution. Visually flat regions are selectively merged (one representative per mergeable cell) and followed by lightweight token recovery (unmerging), so SAM’s mask decoder still receives a dense feature grid.
  • Figure 3: Illustration of token merging strategies. ToMe and ToMeSD treat all tokens as mergeable, while PiToMe introduces a protected set that is effective only at low merge rates. In contrast, our method preserves structurally important tokens while supporting aggressive token reduction.
  • Figure 4: Illustration of segmentation quality at 45% merging rate. Our method preserves fine structural details and achieves higher segmentation quality than baselines, while other approaches often miss thin structures or incorrectly merge object regions into the background.
  • Figure 5: Token compression results on INbreast dataset using MedSAM
  • ...and 3 more figures

Theorems & Definitions (15)

  • Theorem 1: Informal: Layerwise spectrum stability of score-guided merging
  • Definition 1: Graph Coarsening
  • Definition 2: Graph Lifting
  • Lemma 1: Eigenvalue inclusion under lifting
  • proof : Proof of \ref{['lem_structsam_eig_inclusion']}
  • Theorem 2: Formal version of \ref{['thm_structsam_spectrum_stability']}
  • Proposition 1: Row/column drift under correct vs incorrect merges
  • proof : Proof of \ref{['prop_structsam_row_drift']}
  • Proposition 2: Adjacency-to-Laplacian perturbation
  • proof : Proof of \ref{['prop_structsam_L_perturb']}
  • ...and 5 more