Table of Contents
Fetching ...

Diffusion-Based Representation Learning

Sarthak Mittal, Korbinian Abstreiter, Stefan Bauer, Bernhard Schölkopf, Arash Mehrjou

TL;DR

This work augments the denoising score matching framework to enable representation learning without any supervised signal and proposes to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.

Abstract

Diffusion-based methods represented as stochastic differential equations on a continuous-time domain have recently proven successful as a non-adversarial generative model. Training such models relies on denoising score matching, which can be seen as multi-scale denoising autoencoders. Here, we augment the denoising score matching framework to enable representation learning without any supervised signal. GANs and VAEs learn representations by directly transforming latent codes to data samples. In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective and thus encodes the information needed for denoising. We illustrate how this difference allows for manual control of the level of details encoded in the representation. Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification. We also compare the quality of learned representations of diffusion score matching with other methods like autoencoder and contrastively trained systems through their performances on downstream tasks.

Diffusion-Based Representation Learning

TL;DR

This work augments the denoising score matching framework to enable representation learning without any supervised signal and proposes to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.

Abstract

Diffusion-based methods represented as stochastic differential equations on a continuous-time domain have recently proven successful as a non-adversarial generative model. Training such models relies on denoising score matching, which can be seen as multi-scale denoising autoencoders. Here, we augment the denoising score matching framework to enable representation learning without any supervised signal. GANs and VAEs learn representations by directly transforming latent codes to data samples. In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective and thus encodes the information needed for denoising. We illustrate how this difference allows for manual control of the level of details encoded in the representation. Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification. We also compare the quality of learned representations of diffusion score matching with other methods like autoencoder and contrastively trained systems through their performances on downstream tasks.

Paper Structure

This paper contains 27 sections, 1 theorem, 16 equations, 13 figures, 6 tables.

Key Result

Proposition 2.1

For any downstream task, the infinite-dimensional code $(E_\phi(x_0, t))_{t\in[0, T]}$ learned using the objective in Equation eq:time_repr_obj is at least as good as finite-dimensional static codes learned by the reconstruction of $x_0$.

Figures (13)

  • Figure 1: Conditional score matching with a parametrized latent code is representation learning. Denoising score matching estimates the score at each $x_t$; we add a latent representation $z$ of the clean data $x_0$ as additional input to the score estimator.
  • Figure 2: Results of proposed DRL models trained on MNIST and CIFAR-10 with point clouds visualizing the latent representation of test samples, colored according to the digit class. The models are trained with Left: uniform sampling of $t$ and Right: a focus on high noise levels. Samples are generated from a grid of latent values ranging from -1 to 1.
  • Figure 3: Results of proposed VDRL models trained on MNIST and CIFAR-10 with point clouds visualizing the latent representation of test samples, colored according to the digit class. The models are trained with Left: uniform sampling of $t$ and Right: a focus on high noise levels. Samples are generated from a grid of latent values ranging from -2 to 2.
  • Figure 4: Comparing the performance of the proposed diffusion-based representations (DRL and VDRL) with the baselines that include autoencoder (AE), variational autoencoder (VAE), simple contrastive learning (simCLR) and its restricted variant (simCLR-Gauss) which exclude domain-specific data augmentation from the original simCLR algorithm.
  • Figure 5: Comparing the performance of the proposed diffusion-based representations (DRL and VDRL) with the baselines that include autoencoder (AE), variational autoencoder (VAE), simple contrastive learning (simCLR) and its restricted variant (simCLR-Gauss) which exclude domain-specific data augmentation from the original simCLR algorithm.
  • ...and 8 more figures

Theorems & Definitions (3)

  • Proposition 2.1
  • proof
  • proof