DiP: Taming Diffusion Models in Pixel Space
Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, Ying Tai
TL;DR
The paper tackles the high computational cost of diffusion models by proposing DiP, a pixel-space diffusion framework that operates a Diffusion Transformer on large patches for global structure while coupling a lightweight Patch Detailer Head to recover fine local details. This end-to-end, VAE-free design achieves efficiency comparable to latent diffusion models yet maintains high fidelity, evidenced by an ImageNet 256×256 FID of $1.79$ and faster inference (≈$0.92$s per image) with modest parameter overhead. Comprehensive ablations show that injecting local inductive bias via the Patch Detailer Head is key to performance, and that post-hoc refinement offers a simple, effective integration. The results establish DiP as a new state-of-the-art on the efficiency-quality frontier for pixel-space diffusion, with potential extensions to text-to-image and video generation.
Abstract
Diffusion models face a fundamental trade-off between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training. In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10$\times$ faster inference speeds than previous method while increasing the total number of parameters by only 0.3%, and achieves an 1.79 FID score on ImageNet 256$\times$256.
