Table of Contents
Fetching ...

Diffusion Models and Representation Learning: A Survey

Michael Fuest, Pingchuan Ma, Ming Gui, Johannes Schusterbauer, Vincent Tao Hu, Bjorn Ommer

TL;DR

This survey maps the evolving nexus between diffusion models and representation learning, clarifying how diffusion denoising fosters semantic representations and how representations can guide diffusion in a self-supervised manner. It introduces a taxonomy and generalized frameworks for extracting diffusion-based features, transferring them to downstream tasks, and for jointly training or augmenting diffusion models with discriminative objectives. The work highlights methods using intermediate activations, knowledge distillation, latent reconstructions, joint modeling, and generative augmentation, while also detailing assignment-based and representation-based guidance strategies. It also discusses key challenges, such as computational demands and interpretability, and outlines future directions like architecture innovations and flow-matching paradigms to advance diffusion-based representation learning.

Abstract

Diffusion Models are popular generative modeling methods in various vision tasks, attracting significant attention. They can be considered a unique instance of self-supervised learning methods due to their independence from label annotation. This survey explores the interplay between diffusion models and representation learning. It provides an overview of diffusion models' essential aspects, including mathematical foundations, popular denoising network architectures, and guidance methods. Various approaches related to diffusion models and representation learning are detailed. These include frameworks that leverage representations learned from pre-trained diffusion models for subsequent recognition tasks and methods that utilize advancements in representation and self-supervised learning to enhance diffusion models. This survey aims to offer a comprehensive overview of the taxonomy between diffusion models and representation learning, identifying key areas of existing concerns and potential exploration. Github link: https://github.com/dongzhuoyao/Diffusion-Representation-Learning-Survey-Taxonomy

Diffusion Models and Representation Learning: A Survey

TL;DR

This survey maps the evolving nexus between diffusion models and representation learning, clarifying how diffusion denoising fosters semantic representations and how representations can guide diffusion in a self-supervised manner. It introduces a taxonomy and generalized frameworks for extracting diffusion-based features, transferring them to downstream tasks, and for jointly training or augmenting diffusion models with discriminative objectives. The work highlights methods using intermediate activations, knowledge distillation, latent reconstructions, joint modeling, and generative augmentation, while also detailing assignment-based and representation-based guidance strategies. It also discusses key challenges, such as computational demands and interpretability, and outlines future directions like architecture innovations and flow-matching paradigms to advance diffusion-based representation learning.

Abstract

Diffusion Models are popular generative modeling methods in various vision tasks, attracting significant attention. They can be considered a unique instance of self-supervised learning methods due to their independence from label annotation. This survey explores the interplay between diffusion models and representation learning. It provides an overview of diffusion models' essential aspects, including mathematical foundations, popular denoising network architectures, and guidance methods. Various approaches related to diffusion models and representation learning are detailed. These include frameworks that leverage representations learned from pre-trained diffusion models for subsequent recognition tasks and methods that utilize advancements in representation and self-supervised learning to enhance diffusion models. This survey aims to offer a comprehensive overview of the taxonomy between diffusion models and representation learning, identifying key areas of existing concerns and potential exploration. Github link: https://github.com/dongzhuoyao/Diffusion-Representation-Learning-Survey-Taxonomy
Paper Structure (21 sections, 19 equations, 6 figures, 2 tables)

This paper contains 21 sections, 19 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Shows yearly numbers of both published and preprint papers on diffusion models and representation learning. For 2024, the green bar indicates the number of papers collected up to and including June 2024, and the dashed grey bar indicates the projected number for the whole year.
  • Figure 2: Left: Shows qualitative generation results from diffusion models conditioned using self-supervised guidance signals. Right: Shows qualitative results of downstream image tasks that leverage representations learned in training diffusion models. Adapted from li_return_2024, hu_guided_2023, pan_masked_2024, baranchuk_label-efficient_2022, yang_diffusion_2023.
  • Figure 3: Left: An exemplary visualization of the U-Net architecture ronneberger_u-net_2015. Consists of an encoder and a decoder, with residual connections that preserve gradient flow and low-level input details. Adapted from prince_understanding_2023. Right: An exemplary visualization of the DiT architecture. Shows the high-level architecture, as well as a breakdown of the adaLN-Zero DiT block. Adapted from peebles_scalable_2023.
  • Figure 4: A high-level overview of a framework for extracting representations from pre-trained diffusion models for downstream tasks.
  • Figure 5: A hierarchical overview of current diffusion model training frameworks that leverage representation learning techniques for conditional generation and guidance.
  • ...and 1 more figures