Table of Contents
Fetching ...

DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

Mengshi Qi, Pengfei Zhu, Xiangtai Li, Xiaoyang Bi, Lu Qi, Huadong Ma, Ming-Hsuan Yang

TL;DR

DC-SAM addresses in-context segmentation by adapting SAM and SAM2 through prompt-tuning to generate high-quality prompts from both foreground and background priors. The method introduces dual positive/negative prompts and a cycle-consistent cross-attention module, enabling robust one-shot segmentation for images and videos. It also introduces IC-VOS, the first in-context video segmentation benchmark, and demonstrates state-of-the-art results on COCO-20i, PASCAL-5i, and IC-VOS. The work highlights effective use of SAM features in prompt generation and provides a practical, scalable pathway for extending visual foundation models to in-context learning tasks.

Abstract

Given a single labeled example, in-context segmentation aims to segment corresponding objects. This setting, known as one-shot segmentation in few-shot learning, explores the segmentation model's generalization ability and has been applied to various vision tasks, including scene understanding and image/video editing. While recent Segment Anything Models have achieved state-of-the-art results in interactive segmentation, these approaches are not directly applicable to in-context segmentation. In this work, we propose the Dual Consistency SAM (DC-SAM) method based on prompt-tuning to adapt SAM and SAM2 for in-context segmentation of both images and videos. Our key insights are to enhance the features of the SAM's prompt encoder in segmentation by providing high-quality visual prompts. When generating a mask prior, we fuse the SAM features to better align the prompt encoder. Then, we design a cycle-consistent cross-attention on fused features and initial visual prompts. Next, a dual-branch design is provided by using the discriminative positive and negative prompts in the prompt encoder. Furthermore, we design a simple mask-tube training strategy to adopt our proposed dual consistency method into the mask tube. Although the proposed DC-SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support of SAM2. Given the absence of in-context segmentation in the video domain, we manually curate and construct the first benchmark from existing video segmentation datasets, named In-Context Video Object Segmentation (IC-VOS), to better assess the in-context capability of the model. Extensive experiments demonstrate that our method achieves 55.5 (+1.4) mIoU on COCO-20i, 73.0 (+1.1) mIoU on PASCAL-5i, and a J&F score of 71.52 on the proposed IC-VOS benchmark. Our source code and benchmark are available at https://github.com/zaplm/DC-SAM.

DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

TL;DR

DC-SAM addresses in-context segmentation by adapting SAM and SAM2 through prompt-tuning to generate high-quality prompts from both foreground and background priors. The method introduces dual positive/negative prompts and a cycle-consistent cross-attention module, enabling robust one-shot segmentation for images and videos. It also introduces IC-VOS, the first in-context video segmentation benchmark, and demonstrates state-of-the-art results on COCO-20i, PASCAL-5i, and IC-VOS. The work highlights effective use of SAM features in prompt generation and provides a practical, scalable pathway for extending visual foundation models to in-context learning tasks.

Abstract

Given a single labeled example, in-context segmentation aims to segment corresponding objects. This setting, known as one-shot segmentation in few-shot learning, explores the segmentation model's generalization ability and has been applied to various vision tasks, including scene understanding and image/video editing. While recent Segment Anything Models have achieved state-of-the-art results in interactive segmentation, these approaches are not directly applicable to in-context segmentation. In this work, we propose the Dual Consistency SAM (DC-SAM) method based on prompt-tuning to adapt SAM and SAM2 for in-context segmentation of both images and videos. Our key insights are to enhance the features of the SAM's prompt encoder in segmentation by providing high-quality visual prompts. When generating a mask prior, we fuse the SAM features to better align the prompt encoder. Then, we design a cycle-consistent cross-attention on fused features and initial visual prompts. Next, a dual-branch design is provided by using the discriminative positive and negative prompts in the prompt encoder. Furthermore, we design a simple mask-tube training strategy to adopt our proposed dual consistency method into the mask tube. Although the proposed DC-SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support of SAM2. Given the absence of in-context segmentation in the video domain, we manually curate and construct the first benchmark from existing video segmentation datasets, named In-Context Video Object Segmentation (IC-VOS), to better assess the in-context capability of the model. Extensive experiments demonstrate that our method achieves 55.5 (+1.4) mIoU on COCO-20i, 73.0 (+1.1) mIoU on PASCAL-5i, and a J&F score of 71.52 on the proposed IC-VOS benchmark. Our source code and benchmark are available at https://github.com/zaplm/DC-SAM.

Paper Structure

This paper contains 20 sections, 15 equations, 14 figures, 14 tables, 1 algorithm.

Figures (14)

  • Figure 1: Overview of the proposed DC-SAM method and IC-VOS benchmark. a) Comparison of the previous few-shot segmentation methods in 1), existing methods based on SAM/SAM2 in 2), and DC-SAM in 3). DC-SAM leverages multi-source features and generates positive and negative prompts by ensuring prompt consistency, integrating with SAM/SAM2 to achieve in-context segmentation for both images and videos; b) Visualization of image and video settings by DC-SAM; c) Quantitative comparison of DC-SAM with state-of-the-art approaches in terms of mIoU on COCO-20$^i$ and PASCAL-5$^i$, $\mathcal{J\&F}$ on the IC-VOS benchmark.
  • Figure 2: Overview of our constructed IC-VOS benchmark. a) Distribution of video sources and their proportions. b) Word cloud of expressions. c) Categories in the dataset with the number of clips and frames for each category. d) Example cases illustrating the support image, support mask, and query video.
  • Figure 3: Overview of the proposed DC-SAM framework. We use positive and negative branches to generate respective prompts, thereby refining the scope of the final generated mask. Additionally, we incorporate SAM features during the prompt generation process to better capture the characteristics of SAM, resulting in more accurate prompt boundaries. During the prompt generation process, we introduce cyclic consistent cross-attention to filter out non-cycle-consistent feature points, enhancing the precision of the prompts.
  • Figure 4: Illustration of our proposed cyclic consistent cross-attention mechanism. This figure shows the version applied to query features with one head. The "Cyc" operation represents the process described in Equation \ref{['eq:cyc']}, which ultimately generates a bias to filter out features that are not cycle-consistent.
  • Figure 5: Comparison of SAM segmentation results with and without negative prompts. (a) Segmentation of the cage using only positive prompts. (b) Segmentation of the cage using both positive and negative prompts. Although not achieving optimal segmentation results, adding negative prompts allowed for better differentiation between the background, the dinosaur, and the cage, resulting in a significantly improved result.
  • ...and 9 more figures