Table of Contents
Fetching ...

Elastic Diffusion Transformer

Jiangshan Wang, Zeqiang Lai, Jiarui Chen, Jiayi Guo, Hang Guo, Xiu Li, Xiangyu Yue, Chunchao Guo

TL;DR

E-DiT equips each DiT block with a lightweight router that dynamically identifies sample-dependent sparsity from the input latent, and introduces a block-level feature caching mechanism that leverages router predictions to eliminate redundant computations in a training-free manner.

Abstract

Diffusion Transformers (DiT) have demonstrated remarkable generative capabilities but remain highly computationally expensive. Previous acceleration methods, such as pruning and distillation, typically rely on a fixed computational capacity, leading to insufficient acceleration and degraded generation quality. To address this limitation, we propose \textbf{Elastic Diffusion Transformer (E-DiT)}, an adaptive acceleration framework for DiT that effectively improves efficiency while maintaining generation quality. Specifically, we observe that the generative process of DiT exhibits substantial sparsity (i.e., some computations can be skipped with minimal impact on quality), and this sparsity varies significantly across samples. Motivated by this observation, E-DiT equips each DiT block with a lightweight router that dynamically identifies sample-dependent sparsity from the input latent. Each router adaptively determines whether the corresponding block can be skipped. If the block is not skipped, the router then predicts the optimal MLP width reduction ratio within the block. During inference, we further introduce a block-level feature caching mechanism that leverages router predictions to eliminate redundant computations in a training-free manner. Extensive experiments across 2D image (Qwen-Image and FLUX) and 3D asset (Hunyuan3D-3.0) demonstrate the effectiveness of E-DiT, achieving up to $\sim$2$\times$ speedup with negligible loss in generation quality. Code will be available at https://github.com/wangjiangshan0725/Elastic-DiT.

Elastic Diffusion Transformer

TL;DR

E-DiT equips each DiT block with a lightweight router that dynamically identifies sample-dependent sparsity from the input latent, and introduces a block-level feature caching mechanism that leverages router predictions to eliminate redundant computations in a training-free manner.

Abstract

Diffusion Transformers (DiT) have demonstrated remarkable generative capabilities but remain highly computationally expensive. Previous acceleration methods, such as pruning and distillation, typically rely on a fixed computational capacity, leading to insufficient acceleration and degraded generation quality. To address this limitation, we propose \textbf{Elastic Diffusion Transformer (E-DiT)}, an adaptive acceleration framework for DiT that effectively improves efficiency while maintaining generation quality. Specifically, we observe that the generative process of DiT exhibits substantial sparsity (i.e., some computations can be skipped with minimal impact on quality), and this sparsity varies significantly across samples. Motivated by this observation, E-DiT equips each DiT block with a lightweight router that dynamically identifies sample-dependent sparsity from the input latent. Each router adaptively determines whether the corresponding block can be skipped. If the block is not skipped, the router then predicts the optimal MLP width reduction ratio within the block. During inference, we further introduce a block-level feature caching mechanism that leverages router predictions to eliminate redundant computations in a training-free manner. Extensive experiments across 2D image (Qwen-Image and FLUX) and 3D asset (Hunyuan3D-3.0) demonstrate the effectiveness of E-DiT, achieving up to 2 speedup with negligible loss in generation quality. Code will be available at https://github.com/wangjiangshan0725/Elastic-DiT.
Paper Structure (15 sections, 19 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 19 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Performance of Elastic Diffusion Transformer (E-DiT) across diverse generation foundation models and modalities.
  • Figure 2: Sample-dependent sparsity in the generation process. We use Qwen-Image Qwen-image to illustrate our observations.(a) Images generated after removing different subsets of DiT blocks from Qwen-Image, showing that block importance varies across samples. (b) Results obtained by skipping selected denoising timesteps using a timestep-wise feature caching strategy liu2025timestep, demonstrating content-dependent sensitivity to timestep removal. (c) Comparison between images generated by the Qwen-Image base model (20B) and a pruned variant (10B) ma2025pluggable, highlighting that computational requirements vary with sample difficulty.
  • Figure 3: Overall pipeline of Elastic Diffusion Transformer (E-DiT). (a). The architecture of the router, which predicts $p_g$ and $p_w$, indicating whether the block can be skipped and the width of the MLP within the block, respectively. (b). The overall structure of the E-DiT, where each transformer block is equipped with a router. (c). The structure of the transformer block within the E-DiT, where the width of the MLP is adaptively reduced according to the router's prediction.
  • Figure 4: Visual comparisons between E-DiT-turbo and open-source baselines based on Qwen-Image.
  • Figure 5: Visual comparison of Hunyuan3D 3.0 without and with E-DiT
  • ...and 2 more figures