Table of Contents
Fetching ...

VRP-SAM: SAM with Visual Reference Prompt

Yanpeng Sun, Jiahui Chen, Shan Zhang, Xinyu Zhang, Xiaofan Li, Qiang Chen, Gang Zhang, Errui Ding, Jingdong Wang, Zechao Li

TL;DR

This work addresses SAM's limitation of relying on per-image geometric prompts by introducing VRP-SAM, which adds a Visual Reference Prompt encoder to transform annotated reference images into prompt embeddings that guide segmentation. The VRP encoder comprises a Feature Augmenter and a Prompt Generator, uses meta-learning to learn semantic-consistent representations, and supports point, scribble, box, and mask references. Empirical results on Pascal-5i and COCO-20i show state-of-the-art performance with only ~1.6M learnable parameters, strong generalization to unseen objects and cross-domain scenarios, and clear advantages over geometric prompts and text-guided approaches. The approach significantly reduces the need for object-specific prompts, enabling efficient, semantic-aware segmentation across large image collections, with code and models available publicly.

Abstract

In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation, creating the VRP-SAM model. In essence, VRP-SAM can utilize annotated reference images to comprehend specific objects and perform segmentation of specific objects in target image. It is note that the VRP encoder can support a variety of annotation formats for reference images, including \textbf{point}, \textbf{box}, \textbf{scribble}, and \textbf{mask}. VRP-SAM achieves a breakthrough within the SAM framework by extending its versatility and applicability while preserving SAM's inherent strengths, thus enhancing user-friendliness. To enhance the generalization ability of VRP-SAM, the VRP encoder adopts a meta-learning strategy. To validate the effectiveness of VRP-SAM, we conducted extensive empirical studies on the Pascal and COCO datasets. Remarkably, VRP-SAM achieved state-of-the-art performance in visual reference segmentation with minimal learnable parameters. Furthermore, VRP-SAM demonstrates strong generalization capabilities, allowing it to perform segmentation of unseen objects and enabling cross-domain segmentation. The source code and models will be available at https://github.com/syp2ysy/VRP-SAM

VRP-SAM: SAM with Visual Reference Prompt

TL;DR

This work addresses SAM's limitation of relying on per-image geometric prompts by introducing VRP-SAM, which adds a Visual Reference Prompt encoder to transform annotated reference images into prompt embeddings that guide segmentation. The VRP encoder comprises a Feature Augmenter and a Prompt Generator, uses meta-learning to learn semantic-consistent representations, and supports point, scribble, box, and mask references. Empirical results on Pascal-5i and COCO-20i show state-of-the-art performance with only ~1.6M learnable parameters, strong generalization to unseen objects and cross-domain scenarios, and clear advantages over geometric prompts and text-guided approaches. The approach significantly reduces the need for object-specific prompts, enabling efficient, semantic-aware segmentation across large image collections, with code and models available publicly.

Abstract

In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation, creating the VRP-SAM model. In essence, VRP-SAM can utilize annotated reference images to comprehend specific objects and perform segmentation of specific objects in target image. It is note that the VRP encoder can support a variety of annotation formats for reference images, including \textbf{point}, \textbf{box}, \textbf{scribble}, and \textbf{mask}. VRP-SAM achieves a breakthrough within the SAM framework by extending its versatility and applicability while preserving SAM's inherent strengths, thus enhancing user-friendliness. To enhance the generalization ability of VRP-SAM, the VRP encoder adopts a meta-learning strategy. To validate the effectiveness of VRP-SAM, we conducted extensive empirical studies on the Pascal and COCO datasets. Remarkably, VRP-SAM achieved state-of-the-art performance in visual reference segmentation with minimal learnable parameters. Furthermore, VRP-SAM demonstrates strong generalization capabilities, allowing it to perform segmentation of unseen objects and enabling cross-domain segmentation. The source code and models will be available at https://github.com/syp2ysy/VRP-SAM
Paper Structure (27 sections, 5 equations, 6 figures, 16 tables)

This paper contains 27 sections, 5 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: Comparison of SAM's built-in Point and Box prompt modes with Visual Reference Prompt when handling numerous images. The prompts are all provided by users.
  • Figure 2: A Comparison between SAM and VRP-SAM. VRP-SAM introduces a visual reference prompt encoder, accepting annotated reference images with point, scribble, box, and mask formats, offering a distinct enhancement over SAM.
  • Figure 3: Proposed VRP-SAM framework. Our approach enables SAM to perform visual reference segmentation by extends a VRP encoder. It takes various granularities of visual references as inputs and encodes these visual references into prompt embeddings. Our VRP encoder consists of a feature augmenter and a prompt generator.
  • Figure 4: The visualization results of VRP-SAM and GP-SAM.
  • Figure 5: Qualitative results of VRP-SAM across diverse image styles is presented. The target images were sourced from the internet.
  • ...and 1 more figures