DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation

Wangbo Zhao; Yizeng Han; Jiasheng Tang; Kai Wang; Hao Luo; Yibing Song; Gao Huang; Fan Wang; Yang You

DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation

Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Hao Luo, Yibing Song, Gao Huang, Fan Wang, Yang You

TL;DR

This work addresses the inefficiency of diffusion transformers by identifying redundancy in static inference across diffusion timesteps and spatial regions. It introduces Dynamic Diffusion Transformer (DyDiT) with timestep-wise dynamic width (TDW) and spatial-wise dynamic token (SDT) to adapt computation, and extends to DyDiT++ with flow matching, video, and text-to-image capabilities, plus a parameter-efficient training approach (TD-LoRA). Across DiT, SiT, Latte, and FLUX, DyDiT++ achieves substantial FLOPs reductions (e.g., ~51% on DiT-XL) with competitive or improved generation quality (FID around 2.07 on ImageNet) and hardware-speedups (up to ~1.73x). The method demonstrates broad applicability, compatibility with efficient samplers and caching, and effective PEFT, enabling faster, cheaper diffusion-based generation at high resolutions and across modalities.

Abstract

Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To overcome this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions. Building on these designs, we present an extended version, DyDiT++, with improvements in three key aspects. First, it extends the generation mechanism of DyDiT beyond diffusion to flow matching, demonstrating that our method can also accelerate flow-matching-based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter-efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT++. Remarkably, with <3% additional fine-tuning iterations, our approach reduces the FLOPs of DiT-XL by 51%, yielding 1.73x realistic speedup on hardware, and achieves a competitive FID score of 2.07 on ImageNet. The code is available at https://github.com/alibaba-damo-academy/DyDiT.

DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation

TL;DR

Abstract

DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (30)