A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence
Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, Ming-Hsuan Yang
TL;DR
This work investigates the internal representations of Stable Diffusion (SD) for semantic and dense image correspondence and reveals that SD features, while spatially coherent, can be semantically imprecise, contrasting with DINOv2’s sparse but accurate matches. By analyzing and combining SD with DINOv2 through a simple, normalization-based fusion, the authors demonstrate strong zero-shot performance gains on SPair-71k, PF-Pascal, and TSS, outperforming many prior unsupervised and some supervised methods. The proposed Fusion strategy leverages SD’s spatial reliability and DINOv2’s semantic precision, yielding improved dense correspondences and enabling applications such as instance swapping with higher fidelity. The paper highlights the complementary nature of diffusion-based features and self-supervisedVision Transformer features, suggesting that straightforward feature fusion can meaningfully advance semantic and dense correspondence tasks.
Abstract
Text-to-image diffusion models have made significant advances in generating and editing high-quality images. As a result, numerous approaches have explored the ability of diffusion model features to understand and process single images for downstream tasks, e.g., classification, semantic segmentation, and stylization. However, significantly less is known about what these features reveal across multiple, different images and objects. In this work, we exploit Stable Diffusion (SD) features for semantic and dense correspondence and discover that with simple post-processing, SD features can perform quantitatively similar to SOTA representations. Interestingly, the qualitative analysis reveals that SD features have very different properties compared to existing representation learning features, such as the recently released DINOv2: while DINOv2 provides sparse but accurate matches, SD features provide high-quality spatial information but sometimes inaccurate semantic matches. We demonstrate that a simple fusion of these two features works surprisingly well, and a zero-shot evaluation using nearest neighbors on these fused features provides a significant performance gain over state-of-the-art methods on benchmark datasets, e.g., SPair-71k, PF-Pascal, and TSS. We also show that these correspondences can enable interesting applications such as instance swapping in two images.
