Table of Contents
Fetching ...

DiP: Taming Diffusion Models in Pixel Space

Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, Ying Tai

TL;DR

The paper tackles the high computational cost of diffusion models by proposing DiP, a pixel-space diffusion framework that operates a Diffusion Transformer on large patches for global structure while coupling a lightweight Patch Detailer Head to recover fine local details. This end-to-end, VAE-free design achieves efficiency comparable to latent diffusion models yet maintains high fidelity, evidenced by an ImageNet 256×256 FID of $1.79$ and faster inference (≈$0.92$s per image) with modest parameter overhead. Comprehensive ablations show that injecting local inductive bias via the Patch Detailer Head is key to performance, and that post-hoc refinement offers a simple, effective integration. The results establish DiP as a new state-of-the-art on the efficiency-quality frontier for pixel-space diffusion, with potential extensions to text-to-image and video generation.

Abstract

Diffusion models face a fundamental trade-off between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training. In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10$\times$ faster inference speeds than previous method while increasing the total number of parameters by only 0.3%, and achieves an 1.79 FID score on ImageNet 256$\times$256.

DiP: Taming Diffusion Models in Pixel Space

TL;DR

The paper tackles the high computational cost of diffusion models by proposing DiP, a pixel-space diffusion framework that operates a Diffusion Transformer on large patches for global structure while coupling a lightweight Patch Detailer Head to recover fine local details. This end-to-end, VAE-free design achieves efficiency comparable to latent diffusion models yet maintains high fidelity, evidenced by an ImageNet 256×256 FID of and faster inference (≈s per image) with modest parameter overhead. Comprehensive ablations show that injecting local inductive bias via the Patch Detailer Head is key to performance, and that post-hoc refinement offers a simple, effective integration. The results establish DiP as a new state-of-the-art on the efficiency-quality frontier for pixel-space diffusion, with potential extensions to text-to-image and video generation.

Abstract

Diffusion models face a fundamental trade-off between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training. In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10 faster inference speeds than previous method while increasing the total number of parameters by only 0.3%, and achieves an 1.79 FID score on ImageNet 256256.

Paper Structure

This paper contains 22 sections, 1 theorem, 36 equations, 30 figures, 4 tables.

Key Result

Theorem A.6

Assume that Assumption ass:pdata, ass:decay and ass:eidit hold. Consider using DiT and DiP for the diffusion generation task as the predictor, respectively. The general near-optimal estimate $\hat{v}^{(s)}_{{\rm DiT}}$ and $\hat{v}^{(s)}_{{\rm DiP}}$ satisfy and respectively, where The denoising operator $\mathbf{P}^{(s)}\hat{\mathbf{B}}\hat{\mathbf{M}}$ and $\mathbf{P}^{(s)} \mathbf{A}\mathbf{

Figures (30)

  • Figure 1: Comparison of vanilla latent diffusion model, vanilla pixel diffusion model and our method. Vanilla LDMs utilize VAEs to balance computational efficiency and generation quality. Vanilla pixel diffusion models use small patches to pursue detailed generation quality. Our method achieves high-quality generation while maintaining efficient end-to-end training in pixel space.
  • Figure 2: Our method achieves the best FID score with minimal computational cost. (Note: LDM latency includes VAE. The methods marked with dashed lines (1.5em1.4pt2pt 1.3pt) are our estimated latency based on the sampling method in the corresponding paper, and should actually be greater than the marked values. The rest methods are the actual test results in the same hardware environment.)
  • Figure 3: Overfitting the DiT-only model using a single image in pixel space leads to poor detail reconstruction. Introducing a local inductive bias achieves better reconstruction and accelerates convergence. Please zoom in for details.
  • Figure 4: The t-SNE visualization of feature space. In the ImageNet validation set, 100 samples were randomly selected from each of the 10 classes for feature visualization. Features are extracted using DiT-only and our method, with each class shown in a distinct color.
  • Figure 5: Patch Detailer Head with local inductive bias was placed at different locations in the model. The results in Sec. \ref{['Analysis']} show that all three methods offer gains compared to DiT-only.
  • ...and 25 more figures

Theorems & Definitions (6)

  • Definition A.1: Patch-level Input
  • Definition A.2: Effective Information
  • Theorem A.6
  • proof
  • Remark A.7
  • proof