Table of Contents
Fetching ...

Diffusion Classifiers Understand Compositionality, but Conditions Apply

Yujin Jeong, Arnas Uselis, Seong Joon Oh, Anna Rohrbach

TL;DR

This work systematically assesses diffusion classifiers for compositional discrimination across SD1.5, SD2.0, and SD3-m, using ten benchmarks and 33 tasks, and introduces Self-Bench to isolate domain effects. It finds that diffusion classifiers can outperform CLIP on certain spatial tasks but often lag on counting/object recognition, with SD3-m not consistently superior to earlier versions. A key insight is that domain gap between generated and real images largely governs discriminative performance, and that lightweight, low-shot timestep weighting can substantially mitigate this gap, especially for SD3-m. Together, Self-Bench and timestep reweighting provide practical tools for diagnosing and narrowing the gap, highlighting that diffusion classifiers understand aspects of compositionality only under well-aligned conditions.

Abstract

Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark \textsc{Self-Bench} comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m. To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at https://github.com/eugene6923/Diffusion-Classifiers-Compositionality.

Diffusion Classifiers Understand Compositionality, but Conditions Apply

TL;DR

This work systematically assesses diffusion classifiers for compositional discrimination across SD1.5, SD2.0, and SD3-m, using ten benchmarks and 33 tasks, and introduces Self-Bench to isolate domain effects. It finds that diffusion classifiers can outperform CLIP on certain spatial tasks but often lag on counting/object recognition, with SD3-m not consistently superior to earlier versions. A key insight is that domain gap between generated and real images largely governs discriminative performance, and that lightweight, low-shot timestep weighting can substantially mitigate this gap, especially for SD3-m. Together, Self-Bench and timestep reweighting provide practical tools for diagnosing and narrowing the gap, highlighting that diffusion classifiers understand aspects of compositionality only under well-aligned conditions.

Abstract

Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark \textsc{Self-Bench} comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m. To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at https://github.com/eugene6923/Diffusion-Classifiers-Compositionality.

Paper Structure

This paper contains 37 sections, 20 equations, 23 figures, 27 tables.

Figures (23)

  • Figure 1: Overview of our findings.Finding I: Diffusion models can perform compositional discrimination reasonably on real images, but underperform CLIP, especially on counting tasks (§\ref{['sec:hypothesis1']}). Finding II: Diffusion models can understand (through classification) the images they can generate (§\ref{['sec:hypothesis2']}). Finding III: Timestep reweighting improves discrimination by reducing the domain gap between generated and real data (§\ref{['sec:hypothesis3']}).
  • Figure 2: Examples of standard benchmarks vs. Self-Bench. Each benchmark is categorized into four broad task groups: Object, Attribute, Position, and Counting. Each group consists of one or more tasks, and we present one example per task for illustration. We indicate positive/negative captions, where the task involves matching the positive caption with its corresponding image. Notably, standard benchmarks and Self-Bench feature domain distinctions, incorporating the factors like style, resolution, and object scale.
  • Figure 3: Diagnosing with Self-Bench. (i) Using Geneval's prompts from six categories, generate images. (ii) For each generated image, create discriminative tasks within its type from the prompts used in the generation process. (iii) Given the generated images (filtered by humans) and the discriminative tasks, benchmark the diffusion classifier.
  • Figure 4: Evaluating compositional generalization across different categories. The bars represent average classification accuracies across all tasks within each category. Notably, SD3-m does not generally outperform other Stable Diffusion models in most benchmarks, and CLIP usually outperforms diffusion models.
  • Figure 5: Self-Bench In-domain performance. (Top three plots) Each row represents the classification accuracy of a diffusion classifier from a specific SD model when evaluated on its own generated data. (Bottom) A positive correlation is observed between generative and discriminative performance. Left axis: discrimination; right axis: generation accuracy.
  • ...and 18 more figures