Table of Contents
Fetching ...

D$^2$iT: Dynamic Diffusion Transformer for Accurate Image Generation

Weinan Jia, Mengqi Huang, Nan Chen, Lei Zhang, Zhendong Mao

TL;DR

This paper addresses the fixed-region compression limitation in diffusion-based image generation by introducing a two-stage, information-density-aware framework. It combines Dynamic VAE (DVAE), which encodes image regions at variable downsampling rates based on local content, with Dynamic Diffusion Transformer (D$^2$iT), which predicts multi-grained noise through a Dynamic Grain Transformer and a Dynamic Content Transformer. The approach unifies global structural coherence and local detail by coupling coarse noise prediction with region-specific fine corrections and introduces a multi-grained loss to optimize across granularities. Empirical results on FFHQ and ImageNet demonstrate substantial quality gains with competitive efficiency, validating the effectiveness of dynamic, region-aware diffusion for high-fidelity image generation.

Abstract

Diffusion models are widely recognized for their ability to generate high-fidelity images. Despite the excellent performance and scalability of the Diffusion Transformer (DiT) architecture, it applies fixed compression across different image regions during the diffusion process, disregarding the naturally varying information densities present in these regions. However, large compression leads to limited local realism, while small compression increases computational complexity and compromises global consistency, ultimately impacting the quality of generated images. To address these limitations, we propose dynamically compressing different image regions by recognizing the importance of different regions, and introduce a novel two-stage framework designed to enhance the effectiveness and efficiency of image generation: (1) Dynamic VAE (DVAE) at first stage employs a hierarchical encoder to encode different image regions at different downsampling rates, tailored to their specific information densities, thereby providing more accurate and natural latent codes for the diffusion process. (2) Dynamic Diffusion Transformer (D$^2$iT) at second stage generates images by predicting multi-grained noise, consisting of coarse-grained (less latent code in smooth regions) and fine-grained (more latent codes in detailed regions), through an novel combination of the Dynamic Grain Transformer and the Dynamic Content Transformer. The strategy of combining rough prediction of noise with detailed regions correction achieves a unification of global consistency and local realism. Comprehensive experiments on various generation tasks validate the effectiveness of our approach. Code will be released at https://github.com/jiawn-creator/Dynamic-DiT.

D$^2$iT: Dynamic Diffusion Transformer for Accurate Image Generation

TL;DR

This paper addresses the fixed-region compression limitation in diffusion-based image generation by introducing a two-stage, information-density-aware framework. It combines Dynamic VAE (DVAE), which encodes image regions at variable downsampling rates based on local content, with Dynamic Diffusion Transformer (DiT), which predicts multi-grained noise through a Dynamic Grain Transformer and a Dynamic Content Transformer. The approach unifies global structural coherence and local detail by coupling coarse noise prediction with region-specific fine corrections and introduces a multi-grained loss to optimize across granularities. Empirical results on FFHQ and ImageNet demonstrate substantial quality gains with competitive efficiency, validating the effectiveness of dynamic, region-aware diffusion for high-fidelity image generation.

Abstract

Diffusion models are widely recognized for their ability to generate high-fidelity images. Despite the excellent performance and scalability of the Diffusion Transformer (DiT) architecture, it applies fixed compression across different image regions during the diffusion process, disregarding the naturally varying information densities present in these regions. However, large compression leads to limited local realism, while small compression increases computational complexity and compromises global consistency, ultimately impacting the quality of generated images. To address these limitations, we propose dynamically compressing different image regions by recognizing the importance of different regions, and introduce a novel two-stage framework designed to enhance the effectiveness and efficiency of image generation: (1) Dynamic VAE (DVAE) at first stage employs a hierarchical encoder to encode different image regions at different downsampling rates, tailored to their specific information densities, thereby providing more accurate and natural latent codes for the diffusion process. (2) Dynamic Diffusion Transformer (DiT) at second stage generates images by predicting multi-grained noise, consisting of coarse-grained (less latent code in smooth regions) and fine-grained (more latent codes in detailed regions), through an novel combination of the Dynamic Grain Transformer and the Dynamic Content Transformer. The strategy of combining rough prediction of noise with detailed regions correction achieves a unification of global consistency and local realism. Comprehensive experiments on various generation tasks validate the effectiveness of our approach. Code will be released at https://github.com/jiawn-creator/Dynamic-DiT.

Paper Structure

This paper contains 15 sections, 11 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustration of our motivation. Compression here refers to the VAE + Patchify operation. (a) Existing fixed-compression diffusion transformer (DiT) ignore information density. Fixed large compression leads to limited local realism due to the limited representation of a few tokens preventing accurate recovery of rich information, whereas fixed small compression leads to limited global consistency and high computational complexity due to the burden of global modeling across patched latents. Samples in (a) are obtained from peebles2023scalable. (b) Our Dynamic Diffusion Transformer (D$^2$iT) adopts a dynamic compression strategy and adds multi-grained noise based on information density, achieving unified global consistency and local realism.
  • Figure 2: The overview of our proposed two-stage framework. (1) Stage 1: DVAE dynamically assigns different grained codes to each image region through the Herarchical Encoder and Dynamic Grained Coding (DGC) module. (2) Stage 2: D$^2$iT consists Dynamic Grain Transformer and Dynamic Content Transformer, which respectively model the spatial granularity information and content information. We present the network with two granularities. The grain map uses '1' to denote coarse-grained regions and '2' for fine-grained regions.
  • Figure 3: Qualitative results of our unconditional generation on FFHQ. In the grain map, red blocks represent fine-grained regions, while blue blocks indicate coarse-grained regions.
  • Figure 4: Qualitative results of D$^2$iT-XL on ImageNet. The grain maps are generated by the Dynamic Grain Transformer based on class labels, and the images are generated by the Dynamic Content Transformer based on class labels and grain maps.
  • Figure 5: The curves of different grain ratios of reconstruction quality (rFID) to generation quality (FID) on FFHQ.
  • ...and 1 more figures