Table of Contents
Fetching ...

BALD-SAM: Disagreement-based Active Prompting in Interactive Segmentation

Prithwijit Chowdhury, Mohit Prabhushankar, Ghassan AlRegib

TL;DR

This work establishes active prompting: a spatial active learning approach where locations within images constitute an unlabeled pool and prompts serve as queries to prioritize information-rich regions, increasing the utility of each interaction.

Abstract

The Segment Anything Model (SAM) has revolutionized interactive segmentation through spatial prompting. While existing work primarily focuses on automating prompts in various settings, real-world annotation workflows involve iterative refinement where annotators observe model outputs and strategically place prompts to resolve ambiguities. Current pipelines typically rely on the annotator's visual assessment of the predicted mask quality. We postulate that a principled approach for automated interactive prompting is to use a model-derived criterion to identify the most informative region for the next prompt. In this work, we establish active prompting: a spatial active learning approach where locations within images constitute an unlabeled pool and prompts serve as queries to prioritize information-rich regions, increasing the utility of each interaction. We further present BALD-SAM: a principled framework adapting Bayesian Active Learning by Disagreement (BALD) to spatial prompt selection by quantifying epistemic uncertainty. To do so, we freeze the entire model and apply Bayesian uncertainty modeling only to a small learned prediction head, making intractable uncertainty estimation practical for large multi-million parameter foundation models. Across 16 datasets spanning natural, medical, underwater, and seismic domains, BALD-SAM demonstrates strong cross-domain performance, ranking first or second on 14 of 16 benchmarks. We validate these gains through a comprehensive ablation suite covering 3 SAM backbones and 35 Laplace posterior configurations, amounting to 38 distinct ablation settings. Beyond strong average performance, BALD-SAM surpasses human prompting and, in several categories, even oracle prompting, while consistently outperforming one-shot baselines in final segmentation quality, particularly on thin and structurally complex objects.

BALD-SAM: Disagreement-based Active Prompting in Interactive Segmentation

TL;DR

This work establishes active prompting: a spatial active learning approach where locations within images constitute an unlabeled pool and prompts serve as queries to prioritize information-rich regions, increasing the utility of each interaction.

Abstract

The Segment Anything Model (SAM) has revolutionized interactive segmentation through spatial prompting. While existing work primarily focuses on automating prompts in various settings, real-world annotation workflows involve iterative refinement where annotators observe model outputs and strategically place prompts to resolve ambiguities. Current pipelines typically rely on the annotator's visual assessment of the predicted mask quality. We postulate that a principled approach for automated interactive prompting is to use a model-derived criterion to identify the most informative region for the next prompt. In this work, we establish active prompting: a spatial active learning approach where locations within images constitute an unlabeled pool and prompts serve as queries to prioritize information-rich regions, increasing the utility of each interaction. We further present BALD-SAM: a principled framework adapting Bayesian Active Learning by Disagreement (BALD) to spatial prompt selection by quantifying epistemic uncertainty. To do so, we freeze the entire model and apply Bayesian uncertainty modeling only to a small learned prediction head, making intractable uncertainty estimation practical for large multi-million parameter foundation models. Across 16 datasets spanning natural, medical, underwater, and seismic domains, BALD-SAM demonstrates strong cross-domain performance, ranking first or second on 14 of 16 benchmarks. We validate these gains through a comprehensive ablation suite covering 3 SAM backbones and 35 Laplace posterior configurations, amounting to 38 distinct ablation settings. Beyond strong average performance, BALD-SAM surpasses human prompting and, in several categories, even oracle prompting, while consistently outperforming one-shot baselines in final segmentation quality, particularly on thin and structurally complex objects.
Paper Structure (39 sections, 25 equations, 3 figures, 7 tables, 1 algorithm)

This paper contains 39 sections, 25 equations, 3 figures, 7 tables, 1 algorithm.

Figures (3)

  • Figure 1: Iterative prompt-based interactive segmentation using SAM. (a) In the interactive loop, SAM receives an input image and a set of user-provided point prompts (positive/inclusion and negative/exclusion) and returns a segmentation mask. A human expert compares the predicted mask against the desired target segmentation and provides additional corrective prompts, which are fed back to SAM in the next iteration. (b) Prompt accumulation and mask evolution across iterations: the left panels show the prompt set at iterations $t=0,1,2$, and the right panels show the corresponding SAM outputs, demonstrating error correction and progressive convergence to the desired object mask.
  • Figure 2: BALD-SAM active prompt sampling. At iteration $t$, the image $\mathcal{I}$ and current prompt set $\mathcal{S}_t$ are processed by frozen SAM components and a Bayesian head sampled from a Laplace posterior. Multiple posterior samples produce an ensemble of mask probability maps, from which we compute a disagreement (mutual-information) map. The location with the highest BALD score is queried next, the user returns its label, and the prompt set is updated.
  • Figure 3: Strategy comparison across datasets using $\Delta$IoU over iterative prompting. Each subplot corresponds to one dataset (arranged in a 4$\times$4 grid) and shows $\Delta$IoU versus interaction iteration for HUMAN, BALD-SAM (ours), ENTROPY, RANDOM, and ORACLE strategies, averaged across seeds for a 15-iteration run. To enable within-dataset comparison of trend dynamics, $\Delta$IoU values are min--max normalized separately for each data source. The grid spans diverse domains, including natural images, medical images, underwater images, and seismic images, highlighting the robustness and cross-domain consistency of BALD-SAM under a unified evaluation protocol.