Table of Contents
Fetching ...

DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation

Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Hao Luo, Yibing Song, Gao Huang, Fan Wang, Yang You

TL;DR

This work addresses the inefficiency of diffusion transformers by identifying redundancy in static inference across diffusion timesteps and spatial regions. It introduces Dynamic Diffusion Transformer (DyDiT) with timestep-wise dynamic width (TDW) and spatial-wise dynamic token (SDT) to adapt computation, and extends to DyDiT++ with flow matching, video, and text-to-image capabilities, plus a parameter-efficient training approach (TD-LoRA). Across DiT, SiT, Latte, and FLUX, DyDiT++ achieves substantial FLOPs reductions (e.g., ~51% on DiT-XL) with competitive or improved generation quality (FID around 2.07 on ImageNet) and hardware-speedups (up to ~1.73x). The method demonstrates broad applicability, compatibility with efficient samplers and caching, and effective PEFT, enabling faster, cheaper diffusion-based generation at high resolutions and across modalities.

Abstract

Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To overcome this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions. Building on these designs, we present an extended version, DyDiT++, with improvements in three key aspects. First, it extends the generation mechanism of DyDiT beyond diffusion to flow matching, demonstrating that our method can also accelerate flow-matching-based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter-efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT++. Remarkably, with <3% additional fine-tuning iterations, our approach reduces the FLOPs of DiT-XL by 51%, yielding 1.73x realistic speedup on hardware, and achieves a competitive FID score of 2.07 on ImageNet. The code is available at https://github.com/alibaba-damo-academy/DyDiT.

DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation

TL;DR

This work addresses the inefficiency of diffusion transformers by identifying redundancy in static inference across diffusion timesteps and spatial regions. It introduces Dynamic Diffusion Transformer (DyDiT) with timestep-wise dynamic width (TDW) and spatial-wise dynamic token (SDT) to adapt computation, and extends to DyDiT++ with flow matching, video, and text-to-image capabilities, plus a parameter-efficient training approach (TD-LoRA). Across DiT, SiT, Latte, and FLUX, DyDiT++ achieves substantial FLOPs reductions (e.g., ~51% on DiT-XL) with competitive or improved generation quality (FID around 2.07 on ImageNet) and hardware-speedups (up to ~1.73x). The method demonstrates broad applicability, compatibility with efficient samplers and caching, and effective PEFT, enabling faster, cheaper diffusion-based generation at high resolutions and across modalities.

Abstract

Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To overcome this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions. Building on these designs, we present an extended version, DyDiT++, with improvements in three key aspects. First, it extends the generation mechanism of DyDiT beyond diffusion to flow matching, demonstrating that our method can also accelerate flow-matching-based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter-efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT++. Remarkably, with <3% additional fine-tuning iterations, our approach reduces the FLOPs of DiT-XL by 51%, yielding 1.73x realistic speedup on hardware, and achieves a competitive FID score of 2.07 on ImageNet. The code is available at https://github.com/alibaba-damo-academy/DyDiT.

Paper Structure

This paper contains 119 sections, 17 equations, 30 figures, 24 tables.

Figures (30)

  • Figure 1: The core idea of DyDiT.
  • Figure 2: (a) The loss difference between DiT-S and DiT-XL is slight at most timesteps. (b) The Loss maps (normalized within [0, 1]) show that the noise in different spatial locations has varying difficulty levels to predict. (c) The loss paradigm of the flow matching-based method, SiT ma2024sit. (d) The loss paradigm of Latte ma2024latte with 16 frames sampled from $t=600$.
  • Figure 3: Overview of the proposed dynamic diffusion transformer (DyDiT).
  • Figure 4: Implementation of TDW and SDT in a FLUX SingleBlock. The output has a size of $\tilde{N} \times C$. It can then be scattered and added to the input $\mathbf{X} \in \mathbb{R}^{N \times C}$. We omit this scatter-add operation for brevity. Note that the width of $\text{FC2}$ is determined by both $\operatorname{R}_{\text{head}}$ and $\operatorname{R}_{\text{channel}}$.
  • Figure 5: (a) Comparison between the original LoRA and the proposed TD-LoRA. We introduce $M$ expert matrices to replace $\mathbf{B}$ in the original LoRA. (b) Fine-tuning specific parameters in DyDiT$_{\text{PEFT}}$.
  • ...and 25 more figures