Do text-free diffusion models learn discriminative visual representations?

Soumik Mukhopadhyay; Matthew Gwilliam; Yosuke Yamaguchi; Vatsal Agarwal; Namitha Padmanabhan; Archana Swaminathan; Tianyi Zhou; Jun Ohya; Abhinav Shrivastava

Do text-free diffusion models learn discriminative visual representations?

Soumik Mukhopadhyay, Matthew Gwilliam, Yosuke Yamaguchi, Vatsal Agarwal, Namitha Padmanabhan, Archana Swaminathan, Tianyi Zhou, Jun Ohya, Abhinav Shrivastava

TL;DR

It is found that diffusion models are better than GANs, and, with the fusion and feedback mechanisms, can compete with state-of-the-art unsupervised image representation learning methods for discriminative tasks.

Abstract

While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which addresses both families of tasks simultaneously. We identify diffusion models, a state-of-the-art method for generative tasks, as a prime candidate. Such models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high-fidelity, diverse, novel images. We find that the intermediate feature maps of the U-Net are diverse, discriminative feature representations. We propose a novel attention mechanism for pooling feature maps and further leverage this mechanism as DifFormer, a transformer feature fusion of features from different diffusion U-Net blocks and noise steps. We also develop DifFeed, a novel feedback mechanism tailored to diffusion. We find that diffusion models are better than GANs, and, with our fusion and feedback mechanisms, can compete with state-of-the-art unsupervised image representation learning methods for discriminative tasks - image classification with full and semi-supervision, transfer for fine-grained classification, object detection and segmentation, and semantic segmentation. Our project website (https://mgwillia.github.io/diffssl/) and code (https://github.com/soumik-kanad/diffssl) are available publicly.

Do text-free diffusion models learn discriminative visual representations?

TL;DR

Abstract

Paper Structure (38 sections, 4 equations, 10 figures, 18 tables)

This paper contains 38 sections, 4 equations, 10 figures, 18 tables.

Introduction
Related Work
Generative Models.
Discriminative Models.
Unified Models.
Diffusion Features.
Analysis
Preliminaries
Diffusion Models Fundamentals.
Diffusion Models Feature Extraction.
Our Key Findings
Our Proposed Feature Fusion
Attention-based Classification Head
DifFormer: Transformer Feature Fusion
DifFeed: Feedback Feature Fusion
...and 23 more sections

Figures (10)

Figure 1: An overview of our method and results. We propose that out-of-the-box pre trained unconditional diffusion models inherently have discriminative properties that automatically make them unified self-supervised image representation learners, with impressive performance not only for generation, but also for discrimination. We improve on the promising results of out-of-the-box diffusion classifiers with our (a) fusion-based DifFormer, and (b) feedback-based DifFeed methods for intelligently utilizing the unique features of diffusion models. (c) We report exciting performances of our methods on multiple downstream benchmarks.
Figure 2: Hypothesis: Diffusion features from low (region I) and high time step (region III) are not the most discriminative and have lower performance. The best features can be found in early-middle time steps (region II) and vary based on tasks/datasets. At low time steps, the diffusion model focuses more on stochastic details rather than structure, while at high time steps since the input is less recognizable, feature quality degrades.
Figure 3: Feature representation comparisons via centered kernel alignment (CKA). (a) Similarity of diffusion U-Net features across blocks at $t=90$ with features from MAE (ViT-B) layers. (b) Similarity across blocks of the diffusion U-Net at $t=90$. (c) Similarity across timesteps of features from U-Net block $b=24$. (a), (b), and (c) point toward the diffusion U-Net features being quite diverse.
Figure 4: Ablations on ImageNet (1000 classes) with varying time steps, block numbers, and pooling size, for a linear classification head on frozen features. We find the model is least sensitive to pooling, and most sensitive to block number, although there is also a steep drop-off in performance as inputs and predictions become noisier. We further provide ResNet-50's (R50) performance over noisy time step images for comparison.
Figure 5: FGVC feature extraction analysis. We show accuracy for different block numbers, time steps, and pooling sizes. Block 19 is superior for FGVC, in contrast to ImageNet where 24 was ideal.
...and 5 more figures

Do text-free diffusion models learn discriminative visual representations?

TL;DR

Abstract

Do text-free diffusion models learn discriminative visual representations?

Authors

TL;DR

Abstract

Table of Contents

Figures (10)