Table of Contents
Fetching ...

Text-to-Image Diffusion Models are Zero-Shot Classifiers

Kevin Clark, Priyank Jaini

TL;DR

This work shows that text-to-image diffusion models can function as zero-shot classifiers by leveraging denoising performance under class-conditional prompts as a proxy for likelihood. It introduces a practical, albeit compute-intensive, framework to evaluate discriminative capabilities of diffusion models (Imagen and Stable Diffusion) and compares them to CLIP across classification, texture-shape cue robustness, and attribute binding tasks. The results reveal competitive zero-shot accuracy with CLIP, state-of-the-art robustness to cue-conflicts, and selective attribute-binding abilities that CLIP lacks, highlighting the potential of generative pre-training for vision-language tasks. These findings suggest that generative pre-training can be a powerful alternative or complement to contrastive pre-training for downstream discriminative tasks and give a quantitative lens for studying diffusion models' knowledge.

Abstract

The excellent generative capabilities of text-to-image diffusion models suggest they learn informative representations of image-text data. However, what knowledge their representations capture is not fully understood, and they have not been thoroughly explored on downstream tasks. We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers. The key idea is using a diffusion model's ability to denoise a noised image given a text description of a label as a proxy for that label's likelihood. We apply our method to Stable Diffusion and Imagen, using it to probe fine-grained aspects of the models' knowledge and comparing them with CLIP's zero-shot abilities. They perform competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, they achieve state-of-the-art results on shape/texture bias tests and can successfully perform attribute binding while CLIP cannot. Although generative pre-training is prevalent in NLP, visual foundation models often use other methods such as contrastive learning. Based on our findings, we argue that generative pre-training should be explored as a compelling alternative for vision-language tasks.

Text-to-Image Diffusion Models are Zero-Shot Classifiers

TL;DR

This work shows that text-to-image diffusion models can function as zero-shot classifiers by leveraging denoising performance under class-conditional prompts as a proxy for likelihood. It introduces a practical, albeit compute-intensive, framework to evaluate discriminative capabilities of diffusion models (Imagen and Stable Diffusion) and compares them to CLIP across classification, texture-shape cue robustness, and attribute binding tasks. The results reveal competitive zero-shot accuracy with CLIP, state-of-the-art robustness to cue-conflicts, and selective attribute-binding abilities that CLIP lacks, highlighting the potential of generative pre-training for vision-language tasks. These findings suggest that generative pre-training can be a powerful alternative or complement to contrastive pre-training for downstream discriminative tasks and give a quantitative lens for studying diffusion models' knowledge.

Abstract

The excellent generative capabilities of text-to-image diffusion models suggest they learn informative representations of image-text data. However, what knowledge their representations capture is not fully understood, and they have not been thoroughly explored on downstream tasks. We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers. The key idea is using a diffusion model's ability to denoise a noised image given a text description of a label as a proxy for that label's likelihood. We apply our method to Stable Diffusion and Imagen, using it to probe fine-grained aspects of the models' knowledge and comparing them with CLIP's zero-shot abilities. They perform competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, they achieve state-of-the-art results on shape/texture bias tests and can successfully perform attribute binding while CLIP cannot. Although generative pre-training is prevalent in NLP, visual foundation models often use other methods such as contrastive learning. Based on our findings, we argue that generative pre-training should be explored as a compelling alternative for vision-language tasks.
Paper Structure (32 sections, 12 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 32 sections, 12 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Zero-Shot Classification using Diffusion Models. We first compute denoising scores for each label prompt across multiple time-steps to generate a scores matrix. We then classify an image by aggregating the scores for each class using a weighting function over the time-steps. The image is assigned the class with the minimum aggregate score. In Section \ref{['subsec:efficiency']}, we discuss how efficiency can be improved only computing a subset of the full scores matrix.
  • Figure 2: Diffusion model classification with pruning.
  • Figure 3: Example predictions from Imagen when denoising the same image with different text prompts. Each set of images shows the original, noised, and denoised images for the two classes. The top two rows use ImageNet images and the bottom row uses Cue-Conflict.
  • Figure 4: Examples of the synthetic-data attribute binding tasks. We explored more sophisticated prompts than in the figure (e.g., "A blender rendering of two objects, one of which is a yellow sphere."), but they didn't substantially change results.
  • Figure 5: Model reliability diagram comparing confidence measures of Imagen on CIFAR-100. The number of model calls used in Algorithm \ref{['alg:efficient']} produces better-calibrated confidences than using the actual scores for different classes.
  • ...and 1 more figures