Table of Contents
Fetching ...

f-DM: A Multi-stage Diffusion Model via Progressive Signal Transformation

Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Miguel Angel Bautista, Josh Susskind

TL;DR

f-DM introduces a multi-stage diffusion framework that enables progressive signal transformations within diffusion models. By integrating a sequence of deterministic transformations (downsampling, blurring, or learned encoders/decoders) with stage-wise forward diffusion, interpolation, and boundary-respecting noise rescaling, it achieves efficient, semantically interpretable generation while preserving a single diffusion process. The approach supports unconditional generation and conditional tasks such as super-resolution and deblurring, and demonstrates competitive or superior results on FFHQ, AFHQ, LSUN, and ImageNet relative to baselines and specialized variants. Ablation studies show the importance of interpolation and resolution-aware SNR rescaling, and the method enables latent-space manipulation and direct diffusion over transformed signals without training separate cascades.

Abstract

Diffusion models (DMs) have recently emerged as SoTA tools for generative modeling in various domains. Standard DMs can be viewed as an instantiation of hierarchical variational autoencoders (VAEs) where the latent variables are inferred from input-centered Gaussian distributions with fixed scales and variances. Unlike VAEs, this formulation limits DMs from changing the latent spaces and learning abstract representations. In this work, we propose f-DM, a generalized family of DMs which allows progressive signal transformation. More precisely, we extend DMs to incorporate a set of (hand-designed or learned) transformations, where the transformed input is the mean of each diffusion step. We propose a generalized formulation and derive the corresponding de-noising objective with a modified sampling algorithm. As a demonstration, we apply f-DM in image generation tasks with a range of functions, including down-sampling, blurring, and learned transformations based on the encoder of pretrained VAEs. In addition, we identify the importance of adjusting the noise levels whenever the signal is sub-sampled and propose a simple rescaling recipe. f-DM can produce high-quality samples on standard image generation benchmarks like FFHQ, AFHQ, LSUN, and ImageNet with better efficiency and semantic interpretation.

f-DM: A Multi-stage Diffusion Model via Progressive Signal Transformation

TL;DR

f-DM introduces a multi-stage diffusion framework that enables progressive signal transformations within diffusion models. By integrating a sequence of deterministic transformations (downsampling, blurring, or learned encoders/decoders) with stage-wise forward diffusion, interpolation, and boundary-respecting noise rescaling, it achieves efficient, semantically interpretable generation while preserving a single diffusion process. The approach supports unconditional generation and conditional tasks such as super-resolution and deblurring, and demonstrates competitive or superior results on FFHQ, AFHQ, LSUN, and ImageNet relative to baselines and specialized variants. Ablation studies show the importance of interpolation and resolution-aware SNR rescaling, and the method enables latent-space manipulation and direct diffusion over transformed signals without training separate cascades.

Abstract

Diffusion models (DMs) have recently emerged as SoTA tools for generative modeling in various domains. Standard DMs can be viewed as an instantiation of hierarchical variational autoencoders (VAEs) where the latent variables are inferred from input-centered Gaussian distributions with fixed scales and variances. Unlike VAEs, this formulation limits DMs from changing the latent spaces and learning abstract representations. In this work, we propose f-DM, a generalized family of DMs which allows progressive signal transformation. More precisely, we extend DMs to incorporate a set of (hand-designed or learned) transformations, where the transformed input is the mean of each diffusion step. We propose a generalized formulation and derive the corresponding de-noising objective with a modified sampling algorithm. As a demonstration, we apply f-DM in image generation tasks with a range of functions, including down-sampling, blurring, and learned transformations based on the encoder of pretrained VAEs. In addition, we identify the importance of adjusting the noise levels whenever the signal is sub-sampled and propose a simple rescaling recipe. f-DM can produce high-quality samples on standard image generation benchmarks like FFHQ, AFHQ, LSUN, and ImageNet with better efficiency and semantic interpretation.
Paper Structure (56 sections, 18 equations, 21 figures, 3 tables, 1 algorithm)

This paper contains 56 sections, 18 equations, 21 figures, 3 tables, 1 algorithm.

Figures (21)

  • Figure 1: Visualization of reverse diffusion from $f$-DMs with various signal transformations. ${\bm{x}}_t$ is the denoised output, and ${\bm{z}}_s$ is the input to the next diffusion step. We plot the first three channels of VQVAE latent variables. Low-resolution images are resized to $256^2$ for ease of visualization.
  • Figure 2: (a) the standard DMs; (b) a bottom-up hierarchical VAEs; (c) our proposed $f$-DM.
  • Figure 3: Left: an illustration of the proposed SNR computation for different sampling rates; Right: the comparison of rescaling the noise level for progressive down-sampling. Without noise rescaling, the diffused images in low-resolution quickly become too noisy to distinguish the underline signal.
  • Figure 4: An illustration of the training pipeline.
  • Figure 5: $\uparrow$ Random samples from $f$-DM-DS trained on various datasets; $\downarrow$ Comparison of $f$-DMs and the corresponding baselines under various transformations. Best viewed when zoomed in. All faces presented are synthesized by the models, and are not real identities.
  • ...and 16 more figures