Table of Contents
Fetching ...

Exploring Diffusion Time-steps for Unsupervised Representation Learning

Zhongqi Yue, Jiankun Wang, Qianru Sun, Lei Ji, Eric I-Chao Chang, Hanwang Zhang

TL;DR

This work addresses unsupervised disentangled representation learning by linking diffusion time-steps to attributes. It introduces DiTi, which freezes a pre-trained Denoising Diffusion Probabilistic Model and augments it with a trainable encoder that maps images to a modular feature vector partitioned across time-steps, enabling each time-step to recover cumulatively lost attributes. Theoretical analysis ties attribute loss to diffusion-time-step–dependent overlap between noisy distributions and motivates using time-step–specific features, which yields improved attribute classification and faithful counterfactual generation on CelebA, FFHQ, and Bedroom, outperforming Diff-AE and PDAE baselines. The approach provides a scalable, principled pathway to disentanglement in diffusion-based generative frameworks and suggests future directions such as text-conditioned disentanglement and faster optimization.

Abstract

Representation learning is all about discovering the hidden modular attributes that generate the data faithfully. We explore the potential of Denoising Diffusion Probabilistic Model (DM) in unsupervised learning of the modular attributes. We build a theoretical framework that connects the diffusion time-steps and the hidden attributes, which serves as an effective inductive bias for unsupervised learning. Specifically, the forward diffusion process incrementally adds Gaussian noise to samples at each time-step, which essentially collapses different samples into similar ones by losing attributes, e.g., fine-grained attributes such as texture are lost with less noise added (i.e., early time-steps), while coarse-grained ones such as shape are lost by adding more noise (i.e., late time-steps). To disentangle the modular attributes, at each time-step t, we learn a t-specific feature to compensate for the newly lost attribute, and the set of all 1,...,t-specific features, corresponding to the cumulative set of lost attributes, are trained to make up for the reconstruction error of a pre-trained DM at time-step t. On CelebA, FFHQ, and Bedroom datasets, the learned feature significantly improves attribute classification and enables faithful counterfactual generation, e.g., interpolating only one specified attribute between two images, validating the disentanglement quality. Codes are in https://github.com/yue-zhongqi/diti.

Exploring Diffusion Time-steps for Unsupervised Representation Learning

TL;DR

This work addresses unsupervised disentangled representation learning by linking diffusion time-steps to attributes. It introduces DiTi, which freezes a pre-trained Denoising Diffusion Probabilistic Model and augments it with a trainable encoder that maps images to a modular feature vector partitioned across time-steps, enabling each time-step to recover cumulatively lost attributes. Theoretical analysis ties attribute loss to diffusion-time-step–dependent overlap between noisy distributions and motivates using time-step–specific features, which yields improved attribute classification and faithful counterfactual generation on CelebA, FFHQ, and Bedroom, outperforming Diff-AE and PDAE baselines. The approach provides a scalable, principled pathway to disentanglement in diffusion-based generative frameworks and suggests future directions such as text-conditioned disentanglement and faster optimization.

Abstract

Representation learning is all about discovering the hidden modular attributes that generate the data faithfully. We explore the potential of Denoising Diffusion Probabilistic Model (DM) in unsupervised learning of the modular attributes. We build a theoretical framework that connects the diffusion time-steps and the hidden attributes, which serves as an effective inductive bias for unsupervised learning. Specifically, the forward diffusion process incrementally adds Gaussian noise to samples at each time-step, which essentially collapses different samples into similar ones by losing attributes, e.g., fine-grained attributes such as texture are lost with less noise added (i.e., early time-steps), while coarse-grained ones such as shape are lost by adding more noise (i.e., late time-steps). To disentangle the modular attributes, at each time-step t, we learn a t-specific feature to compensate for the newly lost attribute, and the set of all 1,...,t-specific features, corresponding to the cumulative set of lost attributes, are trained to make up for the reconstruction error of a pre-trained DM at time-step t. On CelebA, FFHQ, and Bedroom datasets, the learned feature significantly improves attribute classification and enables faithful counterfactual generation, e.g., interpolating only one specified attribute between two images, validating the disentanglement quality. Codes are in https://github.com/yue-zhongqi/diti.
Paper Structure (20 sections, 14 equations, 18 figures, 3 tables, 3 algorithms)

This paper contains 20 sections, 14 equations, 18 figures, 3 tables, 3 algorithms.

Figures (18)

  • Figure 1: (a) Illustration of attribute loss as time-step $t$ increases in the forward diffusion process. The two axes depict a two-dimensional sample space. (b) DM reconstructed $\mathbf{x}_0$, denoted as $\hat{\mathbf{x}}_0$, from randomly sampled $\mathbf{x}_t$ at various $t$. DM is pre-trained on CelebA, from where $\mathbf{x}_0$ is drawn.
  • Figure 2: (a) Counterfactual generations on CelebA by manipulating 16 out of 512 feature dimensions (i.e., simulating the edit of a single $\mathbf{z}_i$). A disentangled representation enables editing a single attribute (e.g., gender) without affecting others (e.g., lighting) and promotes faithful extrapolation (e.g., no artifacts). (b) Histogram of the classifier weight value. More dimensions of DiTi weights are closed to $1$ and $0$ (explanations in the text).
  • Figure 2: Ablations on DiTi designs on CelebA. Imbalance and Detach: using our partition and optimization strategy.
  • Figure 3: Illustration of our DiTi. We break down Eq. \ref{['eq:6']} at each time-step. On the right, we show the detailed network design, where $\hat{\mathbf{x}}_0$ denotes the reconstructed $\mathbf{x}_0$ by the pre-trained DM.
  • Figure 4: Improvements in attribute classification precision of our DiTi over PDAE (bottom) and over SimCLR (top). Improvements more than 2% are highlighted with red bars. Negative values are marked with blue bars.
  • ...and 13 more figures