Table of Contents
Fetching ...

DiffSim: Taming Diffusion Models for Evaluating Visual Similarity

Yiren Song, Xiaokang Liu, Mike Zheng Shou

TL;DR

<3-5 sentence high-level summary> DiffSim introduces a diffusion-model–based framework for evaluating visual similarity between reference and generated images, addressing the limitations of pixel- and high-level semantic metrics. By extracting and aligning attention features from the denoising U‑Net through the Aligned Attention Score, DiffSim captures both appearance and style coherence without fine-tuning. It presents two variants, DiffSim-S (self-attention) and DiffSim-C (cross-attention with IP-Adapter Plus), and introduces Sref and IP benchmarks to quantify style and instance similarity, respectively. The method achieves state-of-the-art alignment with human judgments across diverse benchmarks and demonstrates adaptability to CLIP and DINO, offering a robust, scalable tool for evaluating visual coherence in generative tasks.

Abstract

Diffusion models have fundamentally transformed the field of generative models, making the assessment of similarity between customized model outputs and reference inputs critically important. However, traditional perceptual similarity metrics operate primarily at the pixel and patch levels, comparing low-level colors and textures but failing to capture mid-level similarities and differences in image layout, object pose, and semantic content. Contrastive learning-based CLIP and self-supervised learning-based DINO are often used to measure semantic similarity, but they highly compress image features, inadequately assessing appearance details. This paper is the first to discover that pretrained diffusion models can be utilized for measuring visual similarity and introduces the DiffSim method, addressing the limitations of traditional metrics in capturing perceptual consistency in custom generation tasks. By aligning features in the attention layers of the denoising U-Net, DiffSim evaluates both appearance and style similarity, showing superior alignment with human visual preferences. Additionally, we introduce the Sref and IP benchmarks to evaluate visual similarity at the level of style and instance, respectively. Comprehensive evaluations across multiple benchmarks demonstrate that DiffSim achieves state-of-the-art performance, providing a robust tool for measuring visual coherence in generative models.

DiffSim: Taming Diffusion Models for Evaluating Visual Similarity

TL;DR

<3-5 sentence high-level summary> DiffSim introduces a diffusion-model–based framework for evaluating visual similarity between reference and generated images, addressing the limitations of pixel- and high-level semantic metrics. By extracting and aligning attention features from the denoising U‑Net through the Aligned Attention Score, DiffSim captures both appearance and style coherence without fine-tuning. It presents two variants, DiffSim-S (self-attention) and DiffSim-C (cross-attention with IP-Adapter Plus), and introduces Sref and IP benchmarks to quantify style and instance similarity, respectively. The method achieves state-of-the-art alignment with human judgments across diverse benchmarks and demonstrates adaptability to CLIP and DINO, offering a robust, scalable tool for evaluating visual coherence in generative tasks.

Abstract

Diffusion models have fundamentally transformed the field of generative models, making the assessment of similarity between customized model outputs and reference inputs critically important. However, traditional perceptual similarity metrics operate primarily at the pixel and patch levels, comparing low-level colors and textures but failing to capture mid-level similarities and differences in image layout, object pose, and semantic content. Contrastive learning-based CLIP and self-supervised learning-based DINO are often used to measure semantic similarity, but they highly compress image features, inadequately assessing appearance details. This paper is the first to discover that pretrained diffusion models can be utilized for measuring visual similarity and introduces the DiffSim method, addressing the limitations of traditional metrics in capturing perceptual consistency in custom generation tasks. By aligning features in the attention layers of the denoising U-Net, DiffSim evaluates both appearance and style similarity, showing superior alignment with human visual preferences. Additionally, we introduce the Sref and IP benchmarks to evaluate visual similarity at the level of style and instance, respectively. Comprehensive evaluations across multiple benchmarks demonstrate that DiffSim achieves state-of-the-art performance, providing a robust tool for measuring visual coherence in generative models.

Paper Structure

This paper contains 39 sections, 6 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: We propose DiffSim, a method that utilizes pre-trained diffusion models to extract image features for evaluating visual similarity. Our method leads in human judgment consistency, style similarity, and instance-level consistency.
  • Figure 2: The illustration shows two DiffSim implementations: DiffSim-S using self-attention, where U-Net extracts features from both images to compute Aligned Attention Score (AAS) at a specified layer; and DiffSim-C using cross-attention, where features are extracted via IP-Adapter Plus and U-Net with swapped image inputs.
  • Figure 3: To evaluate style similarity and instance-level similarity, we introduced the Sref bench and IP bench. The Sref dataset contains 508 styles, each generated by Midjourney's sref mode and handpicked by human artists, represented through four different thematic reference images. The IP dataset includes a set of 299 IPs comprising highly similar images along with variants that gradually decrease in similarity.
  • Figure 4: Some retrieval examples using DiffSim, CLIP, and DINO v2. The left, middel and right column displays retrieval results from the Sref benchmark, MS COCO Test dataset and the IP benchmark respectively.
  • Figure 5: Evaluation of different benchmarks across different timesteps, blocks and resolutions. (1) All experiments across timesteps are conducted on the $\text{U}_0$ block of SD 1.5. (2) The results of different blocks of Sref, IP, and NIGHTS experiments are completed with fixed $t$ of 900, 750, and 600, respectively. (3) The results of different resolutions are all based on the best settings of timesteps and blocks.
  • ...and 11 more figures