Table of Contents
Fetching ...

Where's Waldo: Diffusion Features for Personalized Segmentation and Retrieval

Dvir Samuel, Rami Ben-Ari, Matan Levy, Nir Darshan, Gal Chechik

TL;DR

A novel approach called PDM for Personalized Features Diffusion Matching is proposed, that leverages intermediate features of pre-trained text-to-image models for personalization tasks without any additional training, and demonstrates superior performance on popular retrieval and segmentation benchmarks, outperforming even supervised methods.

Abstract

Personalized retrieval and segmentation aim to locate specific instances within a dataset based on an input image and a short description of the reference instance. While supervised methods are effective, they require extensive labeled data for training. Recently, self-supervised foundation models have been introduced to these tasks showing comparable results to supervised methods. However, a significant flaw in these models is evident: they struggle to locate a desired instance when other instances within the same class are presented. In this paper, we explore text-to-image diffusion models for these tasks. Specifically, we propose a novel approach called PDM for Personalized Features Diffusion Matching, that leverages intermediate features of pre-trained text-to-image models for personalization tasks without any additional training. PDM demonstrates superior performance on popular retrieval and segmentation benchmarks, outperforming even supervised methods. We also highlight notable shortcomings in current instance and segmentation datasets and propose new benchmarks for these tasks.

Where's Waldo: Diffusion Features for Personalized Segmentation and Retrieval

TL;DR

A novel approach called PDM for Personalized Features Diffusion Matching is proposed, that leverages intermediate features of pre-trained text-to-image models for personalization tasks without any additional training, and demonstrates superior performance on popular retrieval and segmentation benchmarks, outperforming even supervised methods.

Abstract

Personalized retrieval and segmentation aim to locate specific instances within a dataset based on an input image and a short description of the reference instance. While supervised methods are effective, they require extensive labeled data for training. Recently, self-supervised foundation models have been introduced to these tasks showing comparable results to supervised methods. However, a significant flaw in these models is evident: they struggle to locate a desired instance when other instances within the same class are presented. In this paper, we explore text-to-image diffusion models for these tasks. Specifically, we propose a novel approach called PDM for Personalized Features Diffusion Matching, that leverages intermediate features of pre-trained text-to-image models for personalization tasks without any additional training. PDM demonstrates superior performance on popular retrieval and segmentation benchmarks, outperforming even supervised methods. We also highlight notable shortcomings in current instance and segmentation datasets and propose new benchmarks for these tasks.
Paper Structure (17 sections, 8 equations, 8 figures, 2 tables)

This paper contains 17 sections, 8 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Personalized segmentation task involves segmenting a specific reference object in a new scene. Our method is capable to accurately identify the specific reference instance in the target image, even when other objects from the same class are present. While other methods capture visually or semantically similar objects, our method can successfully extract the identical instance, by using a new personalized feature map and fusing semantic and appearance cues. Red and green indicate incorrect and correct segmentations respectively.
  • Figure 1: Single block of a U-Net layer ( Stable Diffusion StableDiffusion).
  • Figure 2: (a) PCA visualization of $\mathcal{Q}^{{\bf S}A}$ features obtained from the first self-attention block in the last layer of the U-Net module, at various diffusion timesteps. Objects with similar textures and colors have similar features. The dog's color in $I_1$ is similar to the colors of both the dog and the cat in $I_2$, indicating textural similarity. Additionally, the localization is sharper at larger timesteps. (b) Visualization of the cross-attention map $\mathcal{F}^S \mathcal{C}^T$ for a given prompt "dog". Note the higher region correlation (brighter colors) corresponding to the dog, while overlooking the cat in the bottom image.
  • Figure 2: Qualitative examples for personalized retrieval: DINOv2 exhibits improved instance-based characteristics compared to OpenCLIP. However, unlike other methods that attend to the color or texture, our (PDM) method can leverage both semantic and appearance cues to successfully identify instances, even under substantial variations.
  • Figure 3: An overview of our Personalized Diffusion Features Matching approach. PDM combines semantic and appearance features for zero-shot personalized retrieval and segmentation. We first extract features from the reference, $I_r$ and target $I_t$ images. Appearance similarity is determined by dot product of cropped foreground features from the reference feature map, $\mathcal{F}_r^{AM}$ and the target feature map $\mathcal{F}^A_t$ (Eq. \ref{['eq:appearance']}) . Semantic similarity is calculated as the product between class name token $\mathcal{C}$ and the target semantic feature map $\mathcal{F}^S_t$ to create a Semantic Map (Eq. \ref{['eq:semantic']}). The final similarity map $S^{DF}$ combines both maps by average pooling. Note, that while the appearance and semantic maps attend on two dogs, their fusion yields a single and correct result.
  • ...and 3 more figures