Table of Contents
Fetching ...

SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation

Claudia Cuttano, Gabriele Trivigno, Giuseppe Averta, Carlo Masone

TL;DR

SANSA reveals that SAM2’s embeddings contain latent semantic information entangled with tracking cues. By inserting lightweight AdaptFormer adapters into the frozen SAM2 and training with a sequential, memory-based objective, SANSA shifts matching from purely visual similarity to semantic similarity, enabling robust few-shot segmentation across unseen classes while preserving SAM2’s promptable segmentation and video object segmentation capabilities. The approach achieves state-of-the-art results on strict few-shot benchmarks and strong generalization in in-context settings, with substantial speedups and a compact parameter footprint. This demonstrates that foundation-model representations can be restructured to expose semantic semantics without full model fine-tuning, offering practical benefits for fast annotation and cross-domain segmentation.

Abstract

Few-shot segmentation aims to segment unseen object categories from just a handful of annotated examples. This requires mechanisms that can both identify semantically related objects across images and accurately produce segmentation masks. We note that Segment Anything 2 (SAM2), with its prompt-and-propagate mechanism, offers both strong segmentation capabilities and a built-in feature matching process. However, we show that its representations are entangled with task-specific cues optimized for object tracking, which impairs its use for tasks requiring higher level semantic understanding. Our key insight is that, despite its class-agnostic pretraining, SAM2 already encodes rich semantic structure in its features. We propose SANSA (Semantically AligNed Segment Anything 2), a framework that makes this latent structure explicit, and repurposes SAM2 for few-shot segmentation through minimal task-specific modifications. SANSA achieves state-of-the-art performance on few-shot segmentation benchmarks specifically designed to assess generalization, outperforms generalist methods in the popular in-context setting, supports various prompts flexible interaction via points, boxes, or scribbles, and remains significantly faster and more compact than prior approaches. Code is available at https://github.com/ClaudiaCuttano/SANSA.

SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation

TL;DR

SANSA reveals that SAM2’s embeddings contain latent semantic information entangled with tracking cues. By inserting lightweight AdaptFormer adapters into the frozen SAM2 and training with a sequential, memory-based objective, SANSA shifts matching from purely visual similarity to semantic similarity, enabling robust few-shot segmentation across unseen classes while preserving SAM2’s promptable segmentation and video object segmentation capabilities. The approach achieves state-of-the-art results on strict few-shot benchmarks and strong generalization in in-context settings, with substantial speedups and a compact parameter footprint. This demonstrates that foundation-model representations can be restructured to expose semantic semantics without full model fine-tuning, offering practical benefits for fast annotation and cross-domain segmentation.

Abstract

Few-shot segmentation aims to segment unseen object categories from just a handful of annotated examples. This requires mechanisms that can both identify semantically related objects across images and accurately produce segmentation masks. We note that Segment Anything 2 (SAM2), with its prompt-and-propagate mechanism, offers both strong segmentation capabilities and a built-in feature matching process. However, we show that its representations are entangled with task-specific cues optimized for object tracking, which impairs its use for tasks requiring higher level semantic understanding. Our key insight is that, despite its class-agnostic pretraining, SAM2 already encodes rich semantic structure in its features. We propose SANSA (Semantically AligNed Segment Anything 2), a framework that makes this latent structure explicit, and repurposes SAM2 for few-shot segmentation through minimal task-specific modifications. SANSA achieves state-of-the-art performance on few-shot segmentation benchmarks specifically designed to assess generalization, outperforms generalist methods in the popular in-context setting, supports various prompts flexible interaction via points, boxes, or scribbles, and remains significantly faster and more compact than prior approaches. Code is available at https://github.com/ClaudiaCuttano/SANSA.

Paper Structure

This paper contains 37 sections, 4 equations, 12 figures, 13 tables.

Figures (12)

  • Figure 1: We evaluate frozen SAM2 on few-shot segmentation tasks on four datasets with varying degrees of semantic variability. On datasets candemir2013lungcodella2019skin with low semantic shift and high intra-class visual similarity, SAM2 matches or even outperforms state-of-the-art APSeg he2024apseg. However, on more challenging datasets like COCO and LVIS, with high semantic shift (e.g., cruise ship vs. rowboat), its performance drops significantly, compared with GF-SAM zhang2024bridge. The bottom row illustrates examples of ground-truth masks in both scenarios.
  • Figure 2: We extract SAM2 features from object instances across diverse images and visualize their distribution using the first three principal components of PCA, mapped to RGB channels. The features appear entangled, with clusters mixing across categories, highlighting the lack of a coherent semantic structure in the original feature space. After adapting the feature space with SANSA, well-defined clusters emerge: semantically similar instances group together, forming coherent structures despite intra-class variation in visual appearance.
  • Figure 3: Overview of SANSA: Given $k$ annotated reference images and a target image, we construct a pseudo-video by concatenating them, then leverage SAM2 streaming pipeline to process reference frames together with their annotations sequentially. We restructure SAM2 feature space to make its latent semantic structure explicit, enabling mask propagation based on semantic similarity from reference to target. The emergent semantic structure is visualized by the 3D PCA projection of $\mathcal{F}$.
  • Figure 4: Semantic structure of feature space. (a) PCA visualization of frozen SAM2 and our SANSA features on a COCO fold with unseen classes, showing the first two principal components color-coded by class. SAM2 features exhibit weak semantic separability, indicating entanglement with other signals. (b) PCA-based RGB visualization of SANSA features across images with seen and unseen categories, showing consistent semantic mapping. (c) Part-level semantics and cross-image consistency. We cluster features per object and match clusters across image pairs via Hungarian Matching. This reveals that SANSA captures fine-grained distinctions (e.g., handlebar vs. wheel), spatial layout (e.g., upper vs. lower wheel), and produces representations that align across images.
  • Figure 5: Comparison of inference speed and mIoU, with bubble size representing #parameters. The plot highlights our superior trade-off.
  • ...and 7 more figures