Table of Contents
Fetching ...

Taming SAM3 in the Wild: A Concept Bank for Open-Vocabulary Segmentation

Gensheng Pei, Xiruo Jiang, Yazhou Yao, Xiangbo Shu, Fumin Shen, Byeungwoo Jeon

TL;DR

This work tackles prompt-induced failures of open-vocabulary segmentation under distribution drift by introducing ConceptBank, a parameter-free calibration framework that builds a dataset-specific concept bank from target support data. ConceptBank operates in three stages—prototype anchoring, representative support mining, and prototype-consistent concept fusion—to produce target-calibrated embeddings for each class while keeping SAM3 frozen, enabling efficient, plug-in adaptation at inference time. Across natural-scene and remote-sensing benchmarks, ConceptBank yields robust gains over vanilla SAM3 and competitive baselines, achieving averages of 67.1 mIoU (natural-scene) and 52.1 mIoU (remote sensing) and confirming the value of data-centric prompt calibration for drift robustness. The approach offers a practical, gradient-free pathway to deploy open-vocabulary segmentation in varied domains, with potential to generalize to other multi-modal foundation models.

Abstract

The recent introduction of \texttt{SAM3} has revolutionized Open-Vocabulary Segmentation (OVS) through \textit{promptable concept segmentation}, which grounds pixel predictions in flexible concept prompts. However, this reliance on pre-defined concepts makes the model vulnerable: when visual distributions shift (\textit{data drift}) or conditional label distributions evolve (\textit{concept drift}) in the target domain, the alignment between visual evidence and prompts breaks down. In this work, we present \textsc{ConceptBank}, a parameter-free calibration framework to restore this alignment on the fly. Instead of adhering to static prompts, we construct a dataset-specific concept bank from the target statistics. Our approach (\textit{i}) anchors target-domain evidence via class-wise visual prototypes, (\textit{ii}) mines representative supports to suppress outliers under data drift, and (\textit{iii}) fuses candidate concepts to rectify concept drift. We demonstrate that \textsc{ConceptBank} effectively adapts \texttt{SAM3} to distribution drifts, including challenging natural-scene and remote-sensing scenarios, establishing a new baseline for robustness and efficiency in OVS. Code and model are available at https://github.com/pgsmall/ConceptBank.

Taming SAM3 in the Wild: A Concept Bank for Open-Vocabulary Segmentation

TL;DR

This work tackles prompt-induced failures of open-vocabulary segmentation under distribution drift by introducing ConceptBank, a parameter-free calibration framework that builds a dataset-specific concept bank from target support data. ConceptBank operates in three stages—prototype anchoring, representative support mining, and prototype-consistent concept fusion—to produce target-calibrated embeddings for each class while keeping SAM3 frozen, enabling efficient, plug-in adaptation at inference time. Across natural-scene and remote-sensing benchmarks, ConceptBank yields robust gains over vanilla SAM3 and competitive baselines, achieving averages of 67.1 mIoU (natural-scene) and 52.1 mIoU (remote sensing) and confirming the value of data-centric prompt calibration for drift robustness. The approach offers a practical, gradient-free pathway to deploy open-vocabulary segmentation in varied domains, with potential to generalize to other multi-modal foundation models.

Abstract

The recent introduction of \texttt{SAM3} has revolutionized Open-Vocabulary Segmentation (OVS) through \textit{promptable concept segmentation}, which grounds pixel predictions in flexible concept prompts. However, this reliance on pre-defined concepts makes the model vulnerable: when visual distributions shift (\textit{data drift}) or conditional label distributions evolve (\textit{concept drift}) in the target domain, the alignment between visual evidence and prompts breaks down. In this work, we present \textsc{ConceptBank}, a parameter-free calibration framework to restore this alignment on the fly. Instead of adhering to static prompts, we construct a dataset-specific concept bank from the target statistics. Our approach (\textit{i}) anchors target-domain evidence via class-wise visual prototypes, (\textit{ii}) mines representative supports to suppress outliers under data drift, and (\textit{iii}) fuses candidate concepts to rectify concept drift. We demonstrate that \textsc{ConceptBank} effectively adapts \texttt{SAM3} to distribution drifts, including challenging natural-scene and remote-sensing scenarios, establishing a new baseline for robustness and efficiency in OVS. Code and model are available at https://github.com/pgsmall/ConceptBank.
Paper Structure (25 sections, 9 equations, 16 figures, 12 tables)

This paper contains 25 sections, 9 equations, 16 figures, 12 tables.

Figures (16)

  • Figure 1: Data drift and concept drift in SAM3. We contrast the segmentation outputs of vanilla SAM3 and our ConceptBank under two failure patterns: (a) data drift (shifted visual statistics) and (b) concept drift (shifted label semantics). ConceptBank restores prompt-mask alignment by replacing static source-induced prompts with target-calibrated anchors from a Concept Bank.
  • Figure 2: Illustration of the proposed ConceptBank framework.Phase 1 (Construction §\ref{['sec:cb_construction']}) on target support set $\mathcal{D}$: Stage I estimates class prototypes using frozen $\phi_V$ (masked pooling + class averaging); Stage II mines representative supports $\mathcal{R}_c$ via cosine Top$_K$; Stage III fuses LLM-expanded texts with visual guidance to produce target-calibrated embeddings $e_c^*$, forming Concept Bank $\mathcal{B}$. Phase 2 (Inference §\ref{['sec:cb_inference']}) on target test set: plug $\mathcal{B}$ into frozen SAM3 for test segmentation, bypassing the text encoder $\phi_T$.
  • Figure 3: Qualitative comparison results of open-vocabulary segmentation on natural-scene datasets. Zoom in for best view.
  • Figure 4: Qualitative comparison results of open-vocabulary segmentation on remote-sensing datasets. Zoom in for best view.
  • Figure 5: Qualitative comparison results of open-vocabulary segmentation on Pascal VOC21. Zoom in for best view.
  • ...and 11 more figures