Text-to-Image Diffusion Models are Zero-Shot Classifiers
Kevin Clark, Priyank Jaini
TL;DR
This work shows that text-to-image diffusion models can function as zero-shot classifiers by leveraging denoising performance under class-conditional prompts as a proxy for likelihood. It introduces a practical, albeit compute-intensive, framework to evaluate discriminative capabilities of diffusion models (Imagen and Stable Diffusion) and compares them to CLIP across classification, texture-shape cue robustness, and attribute binding tasks. The results reveal competitive zero-shot accuracy with CLIP, state-of-the-art robustness to cue-conflicts, and selective attribute-binding abilities that CLIP lacks, highlighting the potential of generative pre-training for vision-language tasks. These findings suggest that generative pre-training can be a powerful alternative or complement to contrastive pre-training for downstream discriminative tasks and give a quantitative lens for studying diffusion models' knowledge.
Abstract
The excellent generative capabilities of text-to-image diffusion models suggest they learn informative representations of image-text data. However, what knowledge their representations capture is not fully understood, and they have not been thoroughly explored on downstream tasks. We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers. The key idea is using a diffusion model's ability to denoise a noised image given a text description of a label as a proxy for that label's likelihood. We apply our method to Stable Diffusion and Imagen, using it to probe fine-grained aspects of the models' knowledge and comparing them with CLIP's zero-shot abilities. They perform competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, they achieve state-of-the-art results on shape/texture bias tests and can successfully perform attribute binding while CLIP cannot. Although generative pre-training is prevalent in NLP, visual foundation models often use other methods such as contrastive learning. Based on our findings, we argue that generative pre-training should be explored as a compelling alternative for vision-language tasks.
