Table of Contents
Fetching ...

Diffusion Models Beat GANs on Image Classification

Soumik Mukhopadhyay, Matthew Gwilliam, Vatsal Agarwal, Namitha Padmanabhan, Archana Swaminathan, Srinidhi Hegde, Tianyi Zhou, Abhinav Shrivastava

TL;DR

This paper shows that state-of-the-art diffusion models, traditionally used for image generation, can serve as unified self-supervised learners for both discriminative and generative tasks. By extracting frozen features from a pre-trained guided diffusion U-Net at different diffusion steps $t$ and blocks $b$ and applying various lightweight classification heads, the authors achieve strong ImageNet classification results (best linear probe with an Attention head reaching $71.89\%$ accuracy) and competitive FGVC transfer, while also surpassing BigBiGAN on both generation (FID around $26.21$) and classification metrics. They provide practical guidelines for feature extraction, ablate the effects of $t$, $b$, pooling, and head design, and analyze representations via CKAs to compare diffusion features with ResNets and ViTs. Overall, diffusion models emerge as powerful, flexible unified representations, enabling effective classification without modifying pre-trained weights, albeit with notable computational costs and some dataset-specific considerations.

Abstract

While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which uses a single pre-training stage to address both families of tasks simultaneously. We identify diffusion models as a prime candidate. Diffusion models have risen to prominence as a state-of-the-art method for image generation, denoising, inpainting, super-resolution, manipulation, etc. Such models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high fidelity, diverse, novel images. The U-Net architecture, as a convolution-based architecture, generates a diverse set of feature representations in the form of intermediate feature maps. We present our findings that these embeddings are useful beyond the noise prediction task, as they contain discriminative information and can also be leveraged for classification. We explore optimal methods for extracting and using these embeddings for classification tasks, demonstrating promising results on the ImageNet classification task. We find that with careful feature selection and pooling, diffusion models outperform comparable generative-discriminative methods such as BigBiGAN for classification tasks. We investigate diffusion models in the transfer learning regime, examining their performance on several fine-grained visual classification datasets. We compare these embeddings to those generated by competing architectures and pre-trainings for classification tasks.

Diffusion Models Beat GANs on Image Classification

TL;DR

This paper shows that state-of-the-art diffusion models, traditionally used for image generation, can serve as unified self-supervised learners for both discriminative and generative tasks. By extracting frozen features from a pre-trained guided diffusion U-Net at different diffusion steps and blocks and applying various lightweight classification heads, the authors achieve strong ImageNet classification results (best linear probe with an Attention head reaching accuracy) and competitive FGVC transfer, while also surpassing BigBiGAN on both generation (FID around ) and classification metrics. They provide practical guidelines for feature extraction, ablate the effects of , , pooling, and head design, and analyze representations via CKAs to compare diffusion features with ResNets and ViTs. Overall, diffusion models emerge as powerful, flexible unified representations, enabling effective classification without modifying pre-trained weights, albeit with notable computational costs and some dataset-specific considerations.

Abstract

While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which uses a single pre-training stage to address both families of tasks simultaneously. We identify diffusion models as a prime candidate. Diffusion models have risen to prominence as a state-of-the-art method for image generation, denoising, inpainting, super-resolution, manipulation, etc. Such models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high fidelity, diverse, novel images. The U-Net architecture, as a convolution-based architecture, generates a diverse set of feature representations in the form of intermediate feature maps. We present our findings that these embeddings are useful beyond the noise prediction task, as they contain discriminative information and can also be leveraged for classification. We explore optimal methods for extracting and using these embeddings for classification tasks, demonstrating promising results on the ImageNet classification task. We find that with careful feature selection and pooling, diffusion models outperform comparable generative-discriminative methods such as BigBiGAN for classification tasks. We investigate diffusion models in the transfer learning regime, examining their performance on several fine-grained visual classification datasets. We compare these embeddings to those generated by competing architectures and pre-trainings for classification tasks.
Paper Structure (19 sections, 4 equations, 7 figures, 10 tables)

This paper contains 19 sections, 4 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: An overview of our method and results. We propose that diffusion models are unified self-supervised image representation learners, with impressive performance not only for generation, but also for classification. We explore the feature extraction process in terms of U-Net block number and diffusion noise time step. We also explore different sizes for the feature map pooling. We examine several lightweight architectures for feature classification, including linear (A), multi-layer perceptron (B), CNN (C), and attention-based heads (D). We show the results on such explorations on the right, for classification heads trained on frozen features for ImageNet-50 vangansbeke2020scan, computed at block number 24 and noise time step 90. See Section \ref{['subsec:main_results']} for a detailed discussion.
  • Figure 2: Ablations on ImageNet (1000 classes) with varying block numbers, time steps, and pooling size, for a linear classification head on frozen features. We find the model is least sensitive to pooling, and most sensitive to block number, although there is also a steep drop-off in performance as inputs and predictions become noisier.
  • Figure 3: Images at different time steps of the diffusion process, with noise added successively. We observe that the best accuracies are obtained at $t = 90$.
  • Figure 4: Fine-Grained Visual Classification (FGVC) results. We train our best classification heads from our ImageNet-50 explorations on FGVC datasets (denoted with GD), and compare against the results from linear probing a SimCLR ResNet-50 on the same datasets. Linear is denoted by (L). While SimCLR and SwAV tend to perform better, diffusion achieves promising results, slightly outperforming SimCLR for Aircraft.
  • Figure 5: FGVC feature extraction analysis. We show accuracy for different block numbers, time steps, and pooling sizes. Block 19 is superior for FGVC, in contrast to ImageNet where 24 was ideal.
  • ...and 2 more figures