Table of Contents
Fetching ...

Fixed Point Diffusion Models

Xingjian Bai, Luke Melas-Kyriazi

TL;DR

The paper tackles the inefficiency of diffusion-based image generation by introducing Fixed Point Diffusion Models (FPDM), which embed an implicit fixed-point denoising layer into a diffusion network and operate in latent space. Training leverages Stochastic Jacobian-Free Backpropagation to backprop through a sequence of fixed-point solutions across timesteps, enabling substantial reductions in parameter count and memory while maintaining or improving sampling quality under constrained compute. FPDM introduces two key sampling techniques—timestep smoothing and solution reuse—allowing flexible allocation of compute across timesteps and accelerating convergence. Across datasets including ImageNet, FFHQ, CelebA-HQ, and LSUN-Church, FPDM achieves up to 87% fewer parameters and 60% less training memory than DiT, with superior performance when sampling time or compute is limited, highlighting practical impact for resource-constrained generation tasks. The work also outlines limitations and promising directions, such as scaling to larger datasets and exploring adaptive allocation policies to further exploit the fixed-point framework.

Abstract

We introduce the Fixed Point Diffusion Model (FPDM), a novel approach to image generation that integrates the concept of fixed point solving into the framework of diffusion-based generative modeling. Our approach embeds an implicit fixed point solving layer into the denoising network of a diffusion model, transforming the diffusion process into a sequence of closely-related fixed point problems. Combined with a new stochastic training method, this approach significantly reduces model size, reduces memory usage, and accelerates training. Moreover, it enables the development of two new techniques to improve sampling efficiency: reallocating computation across timesteps and reusing fixed point solutions between timesteps. We conduct extensive experiments with state-of-the-art models on ImageNet, FFHQ, CelebA-HQ, and LSUN-Church, demonstrating substantial improvements in performance and efficiency. Compared to the state-of-the-art DiT model, FPDM contains 87% fewer parameters, consumes 60% less memory during training, and improves image generation quality in situations where sampling computation or time is limited. Our code and pretrained models are available at https://lukemelas.github.io/fixed-point-diffusion-models.

Fixed Point Diffusion Models

TL;DR

The paper tackles the inefficiency of diffusion-based image generation by introducing Fixed Point Diffusion Models (FPDM), which embed an implicit fixed-point denoising layer into a diffusion network and operate in latent space. Training leverages Stochastic Jacobian-Free Backpropagation to backprop through a sequence of fixed-point solutions across timesteps, enabling substantial reductions in parameter count and memory while maintaining or improving sampling quality under constrained compute. FPDM introduces two key sampling techniques—timestep smoothing and solution reuse—allowing flexible allocation of compute across timesteps and accelerating convergence. Across datasets including ImageNet, FFHQ, CelebA-HQ, and LSUN-Church, FPDM achieves up to 87% fewer parameters and 60% less training memory than DiT, with superior performance when sampling time or compute is limited, highlighting practical impact for resource-constrained generation tasks. The work also outlines limitations and promising directions, such as scaling to larger datasets and exploring adaptive allocation policies to further exploit the fixed-point framework.

Abstract

We introduce the Fixed Point Diffusion Model (FPDM), a novel approach to image generation that integrates the concept of fixed point solving into the framework of diffusion-based generative modeling. Our approach embeds an implicit fixed point solving layer into the denoising network of a diffusion model, transforming the diffusion process into a sequence of closely-related fixed point problems. Combined with a new stochastic training method, this approach significantly reduces model size, reduces memory usage, and accelerates training. Moreover, it enables the development of two new techniques to improve sampling efficiency: reallocating computation across timesteps and reusing fixed point solutions between timesteps. We conduct extensive experiments with state-of-the-art models on ImageNet, FFHQ, CelebA-HQ, and LSUN-Church, demonstrating substantial improvements in performance and efficiency. Compared to the state-of-the-art DiT model, FPDM contains 87% fewer parameters, consumes 60% less memory during training, and improves image generation quality in situations where sampling computation or time is limited. Our code and pretrained models are available at https://lukemelas.github.io/fixed-point-diffusion-models.
Paper Structure (25 sections, 2 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 25 sections, 2 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Fixed Point Diffusion Model (FPDM) is a novel and highly efficient approach to image generation with diffusion models. FPDM integrates an implicit fixed point layer into a denoising diffusion model, converting the sampling process into a sequence of fixed point equations. Our model significantly decreases model size and memory usage while improving performance in settings with limited sampling time or computation. We compare our model, trained at a 256 $\times$ 256 resolution against the state-of-the-art DiT william22scalable on four datasets (FFHQ, CelebA-HQ, LSUN-Church, ImageNet) using compute equivalent to $20$ DiT sampling steps. FPDM (right) demonstrates enhanced image quality with 87% fewer parameters and 60% less memory during training.
  • Figure 2: The architecture of FPDM compared with DiT. FPDM keeps the first and last transformer block as pre and post processing layers and replaces the explicit layers in-between with an implicit fixed point layer. Sampling from the full reverse diffusion process involves solving many of these fixed point layers in sequence, which enables the development of new techniques such as timestep smoothing (\ref{['sec:methods_smoothing']}) and solution reuse (\ref{['sec:methods_reuse']}).
  • Figure 3: Illustration of Transformer Block Forward Pass Allocation in FPDM and DiT. Since DiT has to perform full forward passes at each timestep, under limited compute, it can only denoise at a few timesteps with large gaps. FPDM achieves a more balanced distribution through smoothing, thereby reducing the discretization error. Additionally, FPDM offers the flexibility to adjust forward pass allocation per timestep with heuristics like Increasing and Decreasing. Refer to Section \ref{['sec:methods_smoothing']} for details.
  • Figure 4: Timestep smoothing significantly improves performance. Given the same amount of sampling compute (280 transformer blocks), FPDM enables us to allocate computation among more or fewer diffusion timesteps, creating a tradeoff between the number of fixed-point solving iterations per timestep and the number of timesteps in the diffusion process (See \ref{['sec:methods_smoothing']}). Here we explore the performance of our model on ImageNet with fixed point iterations ranging from 1 iteration (across 93 timesteps) to 68 iterations (across 4 timesteps). Each timestep also has 1 pre- and post-layer, so sampling with $k$ iterations utilizes $k+2$ blocks of compute per timestep. The circle and dashed lines show the performance of the baseline DiT-XL/2 model with 28 layers, which in our formulation corresponds to smoothing over 26 iterations. Although our model is slightly worse than DiT at 26 iterations, it significantly outperforms DiT when smoothed across more timesteps, demonstrating the effectiveness of timestep smoothing.
  • Figure 5: Qualitative Results for Smoothing Computation Across Timesteps. We show visual results of FPDM using different numbers of fixed point solving iterations, while keeping the total amount of sampling compute fixed (560 transformer blocks). Our method demonstrates similar performance compared to the baseline with 20 to 30 iterations per timestep and superior generation quality with 4 to 8 iterations, as reflected quantitatively in \ref{['fig:iterations_timesteps_03']}.
  • ...and 7 more figures