Diffusion Classifiers Understand Compositionality, but Conditions Apply

Yujin Jeong; Arnas Uselis; Seong Joon Oh; Anna Rohrbach

Diffusion Classifiers Understand Compositionality, but Conditions Apply

Yujin Jeong, Arnas Uselis, Seong Joon Oh, Anna Rohrbach

TL;DR

This work systematically assesses diffusion classifiers for compositional discrimination across SD1.5, SD2.0, and SD3-m, using ten benchmarks and 33 tasks, and introduces Self-Bench to isolate domain effects. It finds that diffusion classifiers can outperform CLIP on certain spatial tasks but often lag on counting/object recognition, with SD3-m not consistently superior to earlier versions. A key insight is that domain gap between generated and real images largely governs discriminative performance, and that lightweight, low-shot timestep weighting can substantially mitigate this gap, especially for SD3-m. Together, Self-Bench and timestep reweighting provide practical tools for diagnosing and narrowing the gap, highlighting that diffusion classifiers understand aspects of compositionality only under well-aligned conditions.

Abstract

Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark \textsc{Self-Bench} comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m. To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at https://github.com/eugene6923/Diffusion-Classifiers-Compositionality.

Diffusion Classifiers Understand Compositionality, but Conditions Apply

TL;DR

Abstract

Diffusion Classifiers Understand Compositionality, but Conditions Apply

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (23)