Table of Contents
Fetching ...

Personalize Segment Anything Model with One Shot

Renrui Zhang, Zhengkai Jiang, Ziyu Guo, Shilin Yan, Junting Pan, Xianzheng Ma, Hao Dong, Peng Gao, Hongsheng Li

TL;DR

This work tackles the problem of personalizing a general segmentation model (SAM) for user-specified concepts using only one-shot data. It introduces PerSAM, a training-free approach that leverages a location prior, target-guided attention, target-semantic prompting, and cascaded post-refinement, along with a fast scale-aware fine-tuning variant PerSAM-F that learns two parameters to resolve mask-scale ambiguities. The authors validate their methods on the new PerSeg dataset and across standard segmentation benchmarks, showing competitive or superior performance to existing one-shot methods, plus practical benefits for DreamBooth-based personalized image generation. Additionally, PerSAM-F achieves rapid adaptation (about 10 seconds on an A100) without retraining the backbone SAM. Code and resources are released to support adoption and further research.

Abstract

Driven by large-data pre-training, Segment Anything Model (SAM) has been demonstrated as a powerful and promptable framework, revolutionizing the segmentation models. Despite the generality, customizing SAM for specific visual concepts without man-powered prompting is under explored, e.g., automatically segmenting your pet dog in different images. In this paper, we propose a training-free Personalization approach for SAM, termed as PerSAM. Given only a single image with a reference mask, PerSAM first localizes the target concept by a location prior, and segments it within other images or videos via three techniques: target-guided attention, target-semantic prompting, and cascaded post-refinement. In this way, we effectively adapt SAM for private use without any training. To further alleviate the mask ambiguity, we present an efficient one-shot fine-tuning variant, PerSAM-F. Freezing the entire SAM, we introduce two learnable weights for multi-scale masks, only training 2 parameters within 10 seconds for improved performance. To demonstrate our efficacy, we construct a new segmentation dataset, PerSeg, for personalized evaluation, and test our methods on video object segmentation with competitive performance. Besides, our approach can also enhance DreamBooth to personalize Stable Diffusion for text-to-image generation, which discards the background disturbance for better target appearance learning. Code is released at https://github.com/ZrrSkywalker/Personalize-SAM

Personalize Segment Anything Model with One Shot

TL;DR

This work tackles the problem of personalizing a general segmentation model (SAM) for user-specified concepts using only one-shot data. It introduces PerSAM, a training-free approach that leverages a location prior, target-guided attention, target-semantic prompting, and cascaded post-refinement, along with a fast scale-aware fine-tuning variant PerSAM-F that learns two parameters to resolve mask-scale ambiguities. The authors validate their methods on the new PerSeg dataset and across standard segmentation benchmarks, showing competitive or superior performance to existing one-shot methods, plus practical benefits for DreamBooth-based personalized image generation. Additionally, PerSAM-F achieves rapid adaptation (about 10 seconds on an A100) without retraining the backbone SAM. Code and resources are released to support adoption and further research.

Abstract

Driven by large-data pre-training, Segment Anything Model (SAM) has been demonstrated as a powerful and promptable framework, revolutionizing the segmentation models. Despite the generality, customizing SAM for specific visual concepts without man-powered prompting is under explored, e.g., automatically segmenting your pet dog in different images. In this paper, we propose a training-free Personalization approach for SAM, termed as PerSAM. Given only a single image with a reference mask, PerSAM first localizes the target concept by a location prior, and segments it within other images or videos via three techniques: target-guided attention, target-semantic prompting, and cascaded post-refinement. In this way, we effectively adapt SAM for private use without any training. To further alleviate the mask ambiguity, we present an efficient one-shot fine-tuning variant, PerSAM-F. Freezing the entire SAM, we introduce two learnable weights for multi-scale masks, only training 2 parameters within 10 seconds for improved performance. To demonstrate our efficacy, we construct a new segmentation dataset, PerSeg, for personalized evaluation, and test our methods on video object segmentation with competitive performance. Besides, our approach can also enhance DreamBooth to personalize Stable Diffusion for text-to-image generation, which discards the background disturbance for better target appearance learning. Code is released at https://github.com/ZrrSkywalker/Personalize-SAM
Paper Structure (29 sections, 11 equations, 15 figures, 1 table)

This paper contains 29 sections, 11 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Personalization of Segment Anything Model. We customize Segment Anything Model (SAM) kirillov2023segment for specific visual concepts, e.g., your pet dog. With only one-shot data, we introduce two efficient solutions: a training-free PerSAM, and a fine-tuning PerSAM-F.
  • Figure 2: Personalized Segmentation Examples. Our PerSAM (Left) can segment personal objects in any context with favorable performance, and PerSAM-F (right) further alleviates the ambiguity issue by scale-aware fine-tuning.
  • Figure 3: Improving DreamBooth ruiz2022dreambooth with PerSAM. By mitigating the disturbance of backgrounds during training, our approach can help to achieve higher-quality personalized text-to-image generation.
  • Figure 4: Positive-negative Location Prior. We calculate a location confidence map for the target object in new test image by the appearance of all local parts. Then, we select the location prior as the point prompt for PerSAM.
  • Figure 5: Target-guided Attention (Left) & Target-semantic Prompting (Right). To inject SAM with target semantics, we explicitly guide the cross-attention layers, and propose additional prompting with high-level cues.
  • ...and 10 more figures