Table of Contents
Fetching ...

A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, Ming-Hsuan Yang

TL;DR

This work investigates the internal representations of Stable Diffusion (SD) for semantic and dense image correspondence and reveals that SD features, while spatially coherent, can be semantically imprecise, contrasting with DINOv2’s sparse but accurate matches. By analyzing and combining SD with DINOv2 through a simple, normalization-based fusion, the authors demonstrate strong zero-shot performance gains on SPair-71k, PF-Pascal, and TSS, outperforming many prior unsupervised and some supervised methods. The proposed Fusion strategy leverages SD’s spatial reliability and DINOv2’s semantic precision, yielding improved dense correspondences and enabling applications such as instance swapping with higher fidelity. The paper highlights the complementary nature of diffusion-based features and self-supervisedVision Transformer features, suggesting that straightforward feature fusion can meaningfully advance semantic and dense correspondence tasks.

Abstract

Text-to-image diffusion models have made significant advances in generating and editing high-quality images. As a result, numerous approaches have explored the ability of diffusion model features to understand and process single images for downstream tasks, e.g., classification, semantic segmentation, and stylization. However, significantly less is known about what these features reveal across multiple, different images and objects. In this work, we exploit Stable Diffusion (SD) features for semantic and dense correspondence and discover that with simple post-processing, SD features can perform quantitatively similar to SOTA representations. Interestingly, the qualitative analysis reveals that SD features have very different properties compared to existing representation learning features, such as the recently released DINOv2: while DINOv2 provides sparse but accurate matches, SD features provide high-quality spatial information but sometimes inaccurate semantic matches. We demonstrate that a simple fusion of these two features works surprisingly well, and a zero-shot evaluation using nearest neighbors on these fused features provides a significant performance gain over state-of-the-art methods on benchmark datasets, e.g., SPair-71k, PF-Pascal, and TSS. We also show that these correspondences can enable interesting applications such as instance swapping in two images.

A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

TL;DR

This work investigates the internal representations of Stable Diffusion (SD) for semantic and dense image correspondence and reveals that SD features, while spatially coherent, can be semantically imprecise, contrasting with DINOv2’s sparse but accurate matches. By analyzing and combining SD with DINOv2 through a simple, normalization-based fusion, the authors demonstrate strong zero-shot performance gains on SPair-71k, PF-Pascal, and TSS, outperforming many prior unsupervised and some supervised methods. The proposed Fusion strategy leverages SD’s spatial reliability and DINOv2’s semantic precision, yielding improved dense correspondences and enabling applications such as instance swapping with higher fidelity. The paper highlights the complementary nature of diffusion-based features and self-supervisedVision Transformer features, suggesting that straightforward feature fusion can meaningfully advance semantic and dense correspondence tasks.

Abstract

Text-to-image diffusion models have made significant advances in generating and editing high-quality images. As a result, numerous approaches have explored the ability of diffusion model features to understand and process single images for downstream tasks, e.g., classification, semantic segmentation, and stylization. However, significantly less is known about what these features reveal across multiple, different images and objects. In this work, we exploit Stable Diffusion (SD) features for semantic and dense correspondence and discover that with simple post-processing, SD features can perform quantitatively similar to SOTA representations. Interestingly, the qualitative analysis reveals that SD features have very different properties compared to existing representation learning features, such as the recently released DINOv2: while DINOv2 provides sparse but accurate matches, SD features provide high-quality spatial information but sometimes inaccurate semantic matches. We demonstrate that a simple fusion of these two features works surprisingly well, and a zero-shot evaluation using nearest neighbors on these fused features provides a significant performance gain over state-of-the-art methods on benchmark datasets, e.g., SPair-71k, PF-Pascal, and TSS. We also show that these correspondences can enable interesting applications such as instance swapping in two images.
Paper Structure (15 sections, 3 equations, 5 figures, 6 tables)

This paper contains 15 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Semantic correspondence with fused Stable Diffusion and DINO features. On the left, we demonstrate the accuracy of our correspondences and demonstrate the instance swapping process. From top to bottom: Starting with pairs of images (source image in orange box), we fuse Stable Diffusion and DINO features to construct robust representations and build high-quality dense correspondence. This facilitates pixel-level instance swapping, and a subsequent stable-diffusion-based refinement process yields a plausible swapped instance. On the right, we demonstrate the robustness of our approach by matching dog, horses, cows, and even motorcycles to the cat in the source image. Our approach is capable of building reasonable correspondence even when the paired instances exhibit significant differences in categories, shapes, and poses.
  • Figure 2: Analysis of features from different decoder layers in SD.Top: Visualization of PCA-computed features from early (layer 2), intermediate (layers 5 and 8) and final (layer 11) layers. The first three components of PCA, computed across a pair of segmented instances, serve as color channels. Early layers focus more on semantics, while later layers concentrate on textures. Bottom: K-Means clustering of these features. K-Means clusters are computed for each image individually, followed by an application of the Hungarian method to find the optimal match between clusters. The color in each column represents a pair of matched clusters.
  • Figure 3: Analysis of different features for correspondence. We present visualization of PCA for the inputs from DAVIS perazzi2016benchmark (left) and dense correspondence for SPair-71k min2019spair (right). The figures show the performance of SD and DINO features under different inputs: identical instance (top left), pure object masks (bottom left), challenging inputs requiring semantic understanding (right top) and spatial information (right bottom). Please refer to Supplemental B.1 for more results.
  • Figure 4: Semantic flow maps using different features. White mask indicates valid pixels and orange mask separates the background flow. SD features yield smoother flow fields versus DINOv2's isolated outliers.
  • Figure 5: Qualitative comparison of instance swapping with different features. SD features deliver smoother swapped results, DINOv2 reveals greater details, and the fused approach takes the strengths of both. Notably, the fused features generate more faithful results to the reference image, as highlighted by the preserved stripes on the cat instance in the top-right example. Please refer to Supplemental B.2 for more results.