DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

Mengshi Qi; Pengfei Zhu; Xiangtai Li; Xiaoyang Bi; Lu Qi; Huadong Ma; Ming-Hsuan Yang

DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

Mengshi Qi, Pengfei Zhu, Xiangtai Li, Xiaoyang Bi, Lu Qi, Huadong Ma, Ming-Hsuan Yang

TL;DR

DC-SAM addresses in-context segmentation by adapting SAM and SAM2 through prompt-tuning to generate high-quality prompts from both foreground and background priors. The method introduces dual positive/negative prompts and a cycle-consistent cross-attention module, enabling robust one-shot segmentation for images and videos. It also introduces IC-VOS, the first in-context video segmentation benchmark, and demonstrates state-of-the-art results on COCO-20i, PASCAL-5i, and IC-VOS. The work highlights effective use of SAM features in prompt generation and provides a practical, scalable pathway for extending visual foundation models to in-context learning tasks.

Abstract

Given a single labeled example, in-context segmentation aims to segment corresponding objects. This setting, known as one-shot segmentation in few-shot learning, explores the segmentation model's generalization ability and has been applied to various vision tasks, including scene understanding and image/video editing. While recent Segment Anything Models have achieved state-of-the-art results in interactive segmentation, these approaches are not directly applicable to in-context segmentation. In this work, we propose the Dual Consistency SAM (DC-SAM) method based on prompt-tuning to adapt SAM and SAM2 for in-context segmentation of both images and videos. Our key insights are to enhance the features of the SAM's prompt encoder in segmentation by providing high-quality visual prompts. When generating a mask prior, we fuse the SAM features to better align the prompt encoder. Then, we design a cycle-consistent cross-attention on fused features and initial visual prompts. Next, a dual-branch design is provided by using the discriminative positive and negative prompts in the prompt encoder. Furthermore, we design a simple mask-tube training strategy to adopt our proposed dual consistency method into the mask tube. Although the proposed DC-SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support of SAM2. Given the absence of in-context segmentation in the video domain, we manually curate and construct the first benchmark from existing video segmentation datasets, named In-Context Video Object Segmentation (IC-VOS), to better assess the in-context capability of the model. Extensive experiments demonstrate that our method achieves 55.5 (+1.4) mIoU on COCO-20i, 73.0 (+1.1) mIoU on PASCAL-5i, and a J&F score of 71.52 on the proposed IC-VOS benchmark. Our source code and benchmark are available at https://github.com/zaplm/DC-SAM.

DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

TL;DR

Abstract

DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)