Table of Contents
Fetching ...

Intriguing properties of generative classifiers

Priyank Jaini, Kevin Clark, Robert Geirhos

TL;DR

This work compares zero-shot generative classifiers derived from text-to-image models to discriminative models and human data on 17 challenging OOD datasets and perceptual tasks. By estimating class-conditioned likelihoods $p_{ heta}(oldsymbol{x}|y= ext{y}_k)$—via diffusion variational lower bounds for diffusion models or exact likelihoods for autoregressive models—the authors classify by $\tilde{y}=\arg\max_k \log p_{ heta}(oldsymbol{x}|y=\text{y}_k)$. They report four key findings: (i) generative classifiers exhibit human-like shape bias (e.g., Imagen 99%), (ii) near human-level OOD robustness, (iii) strong alignment with human error patterns, and (iv) the ability to capture certain perceptual illusions. The results suggest that generative pre-training can yield robust, human-aligned object recognition and may offer insights for integrating generative and discriminative approaches in vision systems, despite current speed limitations and cross-model confounds. Key equations include $\tilde{y}=\arg\max_k p(y=\text{y}_k|\boldsymbol{x})$ and $\log p_{ heta}(oldsymbol{x}|y=\text{y}_k)$ approximations via $p$-models.

Abstract

What is the best paradigm to recognize objects -- discriminative inference (fast but potentially prone to shortcut learning) or using a generative model (slow but potentially more robust)? We build on recent advances in generative modeling that turn text-to-image models into classifiers. This allows us to study their behavior and to compare them against discriminative models and human psychophysical data. We report four intriguing emergent properties of generative classifiers: they show a record-breaking human-like shape bias (99% for Imagen), near human-level out-of-distribution accuracy, state-of-the-art alignment with human classification errors, and they understand certain perceptual illusions. Our results indicate that while the current dominant paradigm for modeling human object recognition is discriminative inference, zero-shot generative models approximate human object recognition data surprisingly well.

Intriguing properties of generative classifiers

TL;DR

This work compares zero-shot generative classifiers derived from text-to-image models to discriminative models and human data on 17 challenging OOD datasets and perceptual tasks. By estimating class-conditioned likelihoods —via diffusion variational lower bounds for diffusion models or exact likelihoods for autoregressive models—the authors classify by . They report four key findings: (i) generative classifiers exhibit human-like shape bias (e.g., Imagen 99%), (ii) near human-level OOD robustness, (iii) strong alignment with human error patterns, and (iv) the ability to capture certain perceptual illusions. The results suggest that generative pre-training can yield robust, human-aligned object recognition and may offer insights for integrating generative and discriminative approaches in vision systems, despite current speed limitations and cross-model confounds. Key equations include and approximations via -models.

Abstract

What is the best paradigm to recognize objects -- discriminative inference (fast but potentially prone to shortcut learning) or using a generative model (slow but potentially more robust)? We build on recent advances in generative modeling that turn text-to-image models into classifiers. This allows us to study their behavior and to compare them against discriminative models and human psychophysical data. We report four intriguing emergent properties of generative classifiers: they show a record-breaking human-like shape bias (99% for Imagen), near human-level out-of-distribution accuracy, state-of-the-art alignment with human classification errors, and they understand certain perceptual illusions. Our results indicate that while the current dominant paradigm for modeling human object recognition is discriminative inference, zero-shot generative models approximate human object recognition data surprisingly well.
Paper Structure (26 sections, 6 equations, 18 figures, 3 tables, 1 algorithm)

This paper contains 26 sections, 6 equations, 18 figures, 3 tables, 1 algorithm.

Figures (18)

  • Figure 1: Zero-shot generative classifiers achieve a human-level shape bias: 99% for Imagen, 93% for Stable Diffusion, 92% for Parti and 92--99% for individual human observers (96% on average). Most discriminative models are texture biased instead.
  • Figure 2: Classification with a diffusion generative classifier. Given a test image, such as a dog with clock texture (1), a text-to-image generative classifier adds random noise (2) and then reconstructs the image conditioned on the prompt "A bad photo of a $<$class$>$" for each class (3). The reconstructed image closest to the test image in $\mathsf{L}_2$ distance is taken as the classification decision (4); this estimates the diffusion variational lower bound clark2023text. For visualization purposes, icons corresponding to the prompt class are superimposed on the reconstructed images.
  • Figure 3: Out-of-distribution accuracy across 17 challenging datasets geirhos2021partial. Detailed results for all parametric datasets are plotted in \ref{['fig:results_accuracy']}; \ref{['tab:benchmark_table_accurate']} lists accuracies.
  • Figure 4: Error consistency across 17 challenging datasets geirhos2021partial. This metric measures whether errors made by models align with errors made by humans (higher is better).
  • Figure 5: Detailed out-of-distribution accuracy for Imagen, Stable Diffusion and Parti in comparison to human observers. While not always aligning perfectly with human accuracy, the overall robustness achieved by Imagen and Stable Diffusion is comparable to that of human observers even though these models are zero-shot, i.e. neither designed nor trained to do classification.
  • ...and 13 more figures