Table of Contents
Fetching ...

PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, Jiebo Luo

TL;DR

PixelDiT proposes a single-stage, end-to-end pixel-space diffusion model built on a dual-level Transformer to separate global semantics from per-pixel texture refinement, thereby eliminating the autoencoder bottleneck. The patch-level DiT handles broad structure while a lightweight pixel-level PiT performs dense per-pixel updates guided by pixel-wise AdaLN and a token compaction mechanism to keep attention scalable. It achieves 1.61 gFID on ImageNet 256×256 and demonstrates megapixel text-to-image generation with GenEval 0.74 at 1024×1024, approaching state-of-the-art latent diffusion models while avoiding VAE reconstruction artifacts during editing. Ablation studies show the necessity of pixel-level modeling and token compaction for efficient training and high-fidelity textures. Overall, PixelDiT narrows the gap between pixel-space and latent-space diffusion and highlights practical viability for high-resolution pixel-space generation.

Abstract

Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.

PixelDiT: Pixel Diffusion Transformers for Image Generation

TL;DR

PixelDiT proposes a single-stage, end-to-end pixel-space diffusion model built on a dual-level Transformer to separate global semantics from per-pixel texture refinement, thereby eliminating the autoencoder bottleneck. The patch-level DiT handles broad structure while a lightweight pixel-level PiT performs dense per-pixel updates guided by pixel-wise AdaLN and a token compaction mechanism to keep attention scalable. It achieves 1.61 gFID on ImageNet 256×256 and demonstrates megapixel text-to-image generation with GenEval 0.74 at 1024×1024, approaching state-of-the-art latent diffusion models while avoiding VAE reconstruction artifacts during editing. Ablation studies show the necessity of pixel-level modeling and token compaction for efficient training and high-fidelity textures. Overall, PixelDiT narrows the gap between pixel-space and latent-space diffusion and highlights practical viability for high-resolution pixel-space generation.

Abstract

Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.

Paper Structure

This paper contains 37 sections, 5 equations, 18 figures, 11 tables.

Figures (18)

  • Figure 1: Visual results of PixelDiT on text-to-image generation and training-free image editing. Please zoom in for the details. Additional examples are provided in the Appendix.
  • Figure 2: Overview of PixelDiT: a dual-level, fully transformer-based diffusion architecture that operates directly in pixel space. The left figure shows the overall framework of PixelDiT, while the right figure illustrates the detailed structure of the PiT blocks.
  • Figure 3: AdaLN modulation strategies. ($\mathbb{A}$) A naive AdaLN broadcasts a global conditioning vector to all pixels. ($\mathbb{B}$) Patch-wise AdaLN expands semantic tokens to the $p^2$ pixels within each patch. ($\mathbb{C}$) Pixel-wise AdaLN applies an MLP to each semantic token to produce per-pixel scale, shift, and gating parameters, enabling fully context-aligned updates at every pixel.
  • Figure 4: Qualitative results on ImageNet $256 \times 256$ using PixelDiT-XL. We use a classifier-free guidance scale $\alpha_{\mathrm{cfg}} = 4.0$.
  • Figure 5: Convergence analysis of PixelDiT on ImageNet 256$\times$256. (a) gFID vs. training iterations for B, L, and XL models with varying patch sizes. (b) Comparison of B/L/XL models at a fixed patch size $p{=}16$.
  • ...and 13 more figures