Table of Contents
Fetching ...

Dynamic Diffusion Transformer

Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Yibing Song, Gao Huang, Fan Wang, Yang You

TL;DR

This work proposes Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation, and introduces a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps.

Abstract

Diffusion Transformer (DiT), an emerging diffusion model for image generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To address this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. Extensive experiments on various datasets and different-sized models verify the superiority of DyDiT. Notably, with <3% additional fine-tuning iterations, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet. The code is publicly available at https://github.com/NUS-HPC-AI-Lab/ Dynamic-Diffusion-Transformer.

Dynamic Diffusion Transformer

TL;DR

This work proposes Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation, and introduces a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps.

Abstract

Diffusion Transformer (DiT), an emerging diffusion model for image generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To address this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. Extensive experiments on various datasets and different-sized models verify the superiority of DyDiT. Notably, with <3% additional fine-tuning iterations, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet. The code is publicly available at https://github.com/NUS-HPC-AI-Lab/ Dynamic-Diffusion-Transformer.
Paper Structure (67 sections, 6 equations, 15 figures, 17 tables)

This paper contains 67 sections, 6 equations, 15 figures, 17 tables.

Figures (15)

  • Figure 1: (a) The loss difference between DiT-S and DiT-XL across all diffusion timesteps ($T=1000$). The difference is slight at most timesteps. (b) Loss maps (normalized to the range [0, 1]) at different timesteps, show that the noise in different patches has varying levels of difficulty to predict. (c) Difference of the inference paradigm between the static DiT and the proposed DyDiT.
  • Figure 2: Overview of the proposed dynamic diffusion transformer (DyDiT). It reduces the computational redundancy in DiT peebles2023scalable from both timestep and spatial dimensions.
  • Figure 3: FLOPs-FID trade-off for S, B, and XL size models on ImageNet. For clarity, we omit the results of applying ToMe to DiT-B and DiT-XL, as it does not surpass the random pruning.
  • Figure 4: Qualitative comparison of images generated by the original DiT, DiT pruned with magnitude, and DyDiT. All models are of "S" size. The FLOPs ratio $\lambda$ in DyDiT is set to 0.5.
  • Figure 5: Visualization of dynamic architecture. and indicates the deactivated and activated heads in an MHSA block, while and denotes that the channel group is deactivated or activated in an MLP block, respectively. We conduct 250-step DDPM generation.
  • ...and 10 more figures