Table of Contents
Fetching ...

SLiMe: Segment Like Me

Aliasghar Khani, Saeid Asgari Taghanaki, Aditya Sanghi, Ali Mahdavi Amiri, Ghassan Hamarneh

TL;DR

SLiMe addresses the challenge of flexible-granularity image segmentation with minimal supervision by exploiting a pre-trained vision-language diffusion model, Stable Diffusion. It casts segmentation as a one-shot optimization that learns region-specific text embeddings guided by cross- and WAS-attention maps, enabling accurate segmentation of unseen images at the training granularity. Across standard benchmarks, SLiMe outperforms several few-shot and some supervised baselines and demonstrates robustness to occlusion and camouflage. A noted limitation is difficulty with very small targets due to attention map resolution, suggesting future work extending the approach to 3D and video data.

Abstract

Significant strides have been made using large vision-language models, like Stable Diffusion (SD), for a variety of downstream tasks, including image editing, image correspondence, and 3D shape generation. Inspired by these advancements, we explore leveraging these extensive vision-language models for segmenting images at any desired granularity using as few as one annotated sample by proposing SLiMe. SLiMe frames this problem as an optimization task. Specifically, given a single training image and its segmentation mask, we first extract attention maps, including our novel "weighted accumulated self-attention map" from the SD prior. Then, using the extracted attention maps, the text embeddings of Stable Diffusion are optimized such that, each of them, learn about a single segmented region from the training image. These learned embeddings then highlight the segmented region in the attention maps, which in turn can then be used to derive the segmentation map. This enables SLiMe to segment any real-world image during inference with the granularity of the segmented region in the training image, using just one example. Moreover, leveraging additional training data when available, i.e. few-shot, improves the performance of SLiMe. We carried out a knowledge-rich set of experiments examining various design factors and showed that SLiMe outperforms other existing one-shot and few-shot segmentation methods.

SLiMe: Segment Like Me

TL;DR

SLiMe addresses the challenge of flexible-granularity image segmentation with minimal supervision by exploiting a pre-trained vision-language diffusion model, Stable Diffusion. It casts segmentation as a one-shot optimization that learns region-specific text embeddings guided by cross- and WAS-attention maps, enabling accurate segmentation of unseen images at the training granularity. Across standard benchmarks, SLiMe outperforms several few-shot and some supervised baselines and demonstrates robustness to occlusion and camouflage. A noted limitation is difficulty with very small targets due to attention map resolution, suggesting future work extending the approach to 3D and video data.

Abstract

Significant strides have been made using large vision-language models, like Stable Diffusion (SD), for a variety of downstream tasks, including image editing, image correspondence, and 3D shape generation. Inspired by these advancements, we explore leveraging these extensive vision-language models for segmenting images at any desired granularity using as few as one annotated sample by proposing SLiMe. SLiMe frames this problem as an optimization task. Specifically, given a single training image and its segmentation mask, we first extract attention maps, including our novel "weighted accumulated self-attention map" from the SD prior. Then, using the extracted attention maps, the text embeddings of Stable Diffusion are optimized such that, each of them, learn about a single segmented region from the training image. These learned embeddings then highlight the segmented region in the attention maps, which in turn can then be used to derive the segmentation map. This enables SLiMe to segment any real-world image during inference with the granularity of the segmented region in the training image, using just one example. Moreover, leveraging additional training data when available, i.e. few-shot, improves the performance of SLiMe. We carried out a knowledge-rich set of experiments examining various design factors and showed that SLiMe outperforms other existing one-shot and few-shot segmentation methods.
Paper Structure (17 sections, 8 equations, 13 figures, 10 tables)

This paper contains 17 sections, 8 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: SLiMe. Using just one user-annotated image with various granularity (as shown in the leftmost column), SLiMe learns to segment different unseen images in accordance with the same granularity (as depicted in the other columns).
  • Figure 2: Our proposed weighted accumulated self-attention maps' sample results. Employing cross-attention naïvely without the self-attention for segmentation leads to inaccurate and noisy output (a and c). Using self-attention map along with cross-attention map to create WAS-attention map enhances the segmentation (b and d).
  • Figure 3: Optimization step. After extracting image embeddings and adding noise, we pass them, along with a text embedding obtained either by using a text encoder or initialized randomly, through the UNet to obtain cross- and WAS-attention maps. Two losses are then calculated using these maps and the ground truth mask. Additionally, SD's loss is incorporated from comparing the added noise with the UNet's predicted noise.
  • Figure 4: Attention-Extraction module. To extract WAS-attention map of $k^{th}$ text embedding with respect to an image, we follow these three steps: (1) We feed the $k^{th}$ text embedding ($\mathcal{P}_k$) together with the noised embedding of the image ($\mathcal{I}_t$) to the UNet. Then calculate $A_{ca}^k$ by extracting the cross-attention maps of $\mathcal{P}_k$ from several layers, resizing and averaging them. (2) We extract the self-attention maps from several layers and average them ($A_{sa}$). (3) Finally, we flatten $A_{ca}^k$ to get $F_{ca}^k$ and calculate a weighted sum of channels of $A_{sa}$, by weights coming from $F_{ca}^k$, and call it "Weighted Accumulated Self-attention map" ($S_{\text{WAS}}^k$). The UNet also produces an output that represents the predicted noise, which is used for calculating the loss of the SD.
  • Figure 5: Segmentation results of camouflaged objects. The larger images are used for optimizing SLiMe, and as the source image for SegGPT. Notably, SLiMe outperforms SegGPT.
  • ...and 8 more figures