Table of Contents
Fetching ...

Diffusion Bridge AutoEncoders for Unsupervised Representation Learning

Yeongmin Kim, Kwanghyeon Lee, Minsang Park, Byeonghu Na, Il-Chul Moon

TL;DR

This work tackles unsupervised representation learning with diffusion models by addressing the information-split problem that arises when an auxiliary encoder and a fixed diffusion endpoint both carry information about the data. It introduces Diffusion Bridge AutoEncoders (DBAE), which impose a ${\mathbf{z}}$-dependent endpoint ${\mathbf{x}}_T$ through a forward SDE augmented by Doob's $h$-transform, making ${\mathbf{z}}$ an information bottleneck. The authors derive an entropy-regularized score-matching objective that jointly optimizes reconstruction and a learnable generative prior, with theoretical guarantees linking the objective to mutual information and KL bounds. Empirically, DBAE improves downstream inference, reconstruction fidelity, disentanglement, and unconditional generation compared to prior diffusion-based methods, while enabling efficient interpolation and attribute manipulation. This approach advances learnable diffusion representations and provides a solid foundation for downstream tasks requiring informative, compact latent variables.

Abstract

Diffusion-based representation learning has achieved substantial attention due to its promising capabilities in latent representation and sample generation. Recent studies have employed an auxiliary encoder to identify a corresponding representation from a sample and to adjust the dimensionality of a latent variable z. Meanwhile, this auxiliary structure invokes information split problem because the diffusion and the auxiliary encoder would divide the information from the sample into two representations for each model. Particularly, the information modeled by the diffusion becomes over-regularized because of the static prior distribution on xT. To address this problem, we introduce Diffusion Bridge AuteEncoders (DBAE), which enable z-dependent endpoint xT inference through a feed-forward architecture. This structure creates an information bottleneck at z, so xT becomes dependent on z in its generation. This results in two consequences: 1) z holds the full information of samples, and 2) xT becomes a learnable distribution, not static any further. We propose an objective function for DBAE to enable both reconstruction and generative modeling, with their theoretical justification. Empirical evidence supports the effectiveness of the intended design in DBAE, which notably enhances downstream inference quality, reconstruction, and disentanglement. Additionally, DBAE generates high-fidelity samples in the unconditional generation. Our code is available at https://github.com/aailab-kaist/DBAE.

Diffusion Bridge AutoEncoders for Unsupervised Representation Learning

TL;DR

This work tackles unsupervised representation learning with diffusion models by addressing the information-split problem that arises when an auxiliary encoder and a fixed diffusion endpoint both carry information about the data. It introduces Diffusion Bridge AutoEncoders (DBAE), which impose a -dependent endpoint through a forward SDE augmented by Doob's -transform, making an information bottleneck. The authors derive an entropy-regularized score-matching objective that jointly optimizes reconstruction and a learnable generative prior, with theoretical guarantees linking the objective to mutual information and KL bounds. Empirically, DBAE improves downstream inference, reconstruction fidelity, disentanglement, and unconditional generation compared to prior diffusion-based methods, while enabling efficient interpolation and attribute manipulation. This approach advances learnable diffusion representations and provides a solid foundation for downstream tasks requiring informative, compact latent variables.

Abstract

Diffusion-based representation learning has achieved substantial attention due to its promising capabilities in latent representation and sample generation. Recent studies have employed an auxiliary encoder to identify a corresponding representation from a sample and to adjust the dimensionality of a latent variable z. Meanwhile, this auxiliary structure invokes information split problem because the diffusion and the auxiliary encoder would divide the information from the sample into two representations for each model. Particularly, the information modeled by the diffusion becomes over-regularized because of the static prior distribution on xT. To address this problem, we introduce Diffusion Bridge AuteEncoders (DBAE), which enable z-dependent endpoint xT inference through a feed-forward architecture. This structure creates an information bottleneck at z, so xT becomes dependent on z in its generation. This results in two consequences: 1) z holds the full information of samples, and 2) xT becomes a learnable distribution, not static any further. We propose an objective function for DBAE to enable both reconstruction and generative modeling, with their theoretical justification. Empirical evidence supports the effectiveness of the intended design in DBAE, which notably enhances downstream inference quality, reconstruction, and disentanglement. Additionally, DBAE generates high-fidelity samples in the unconditional generation. Our code is available at https://github.com/aailab-kaist/DBAE.
Paper Structure (43 sections, 6 theorems, 62 equations, 18 figures, 15 tables, 4 algorithms)

This paper contains 43 sections, 6 theorems, 62 equations, 18 figures, 15 tables, 4 algorithms.

Key Result

Theorem 1

For the objective function $\mathcal{L}_{\textrm{AE}}$, the following equality holds. Moreover, if eq:for is a linear SDE.eq:for is a linear SDE when the drift function $\mathbf{f}$ is linear with respect to ${\mathbf{x}}_t$., there exists $\alpha(t)$, $\beta(t)$, $\gamma(t)$, $\lambda(t)$, such that where $\mathbf{x}^{0}_{\boldsymbol{\theta}}({\mathbf{x}}_t,t,{\mathbf{x}}_T):=\alpha(t){\mathbf{

Figures (18)

  • Figure 1: Comparison between DiffAE preechakul2022diffusion and DBAE. (a) depicts the simplified Bayesian network of DiffAE, illustrating two inference paths for the distinct latent variables ${\mathbf{x}}_T$ and ${\mathbf{z}}$. (b) shows the reconstruction using the inferred ${\mathbf{z}}$ in DiffAE on CelebA, where the reconstruction results perceptually vary depending on the selection of ${\mathbf{x}}_T$. (c) shows the simplified Bayesian network of DBAE with ${\mathbf{z}}$-dependent ${\mathbf{x}}_T$ inference. (d) shows the inferred ${\mathbf{x}}_T$ from DiffAE and DBAE.
  • Figure 2: A schematic for Diffusion Bridge AutoEncoders. The blue line shows the latent variable inference. DBAE infers the ${\mathbf{z}}$-dependent endpoint ${\mathbf{x}}_T$ to make ${\mathbf{x}}_T$ tractable and to establish ${\mathbf{z}}$ as an information bottleneck. The paired ${\mathbf{x}}_0$ and ${\mathbf{x}}_T$ define a new forward SDE utilizing the $h$-transform. The decoder and the red line show the generative process. The generation starts from the bottleneck latent variable ${\mathbf{z}}$ and decodes it to the endpoint ${\mathbf{x}}_T$. The reverse process generates ${\mathbf{x}}_0$ from ${\mathbf{x}}_T$.
  • Figure 3: Reconstruction w/ inferred ${\mathbf{z}}$.
  • Figure 4: TAD-FID tradeoffs compared to the baselines.
  • Figure 5: Top two rows: uncurated samples. Bottom two rows: the sampling trajectory with ODE and SDE.
  • ...and 13 more figures

Theorems & Definitions (9)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 3
  • proof
  • Theorem 3
  • proof
  • Theorem 3
  • proof