Table of Contents
Fetching ...

LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers

Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Yanyu Li, Yifan Gong, Kai Zhang, Hao Tan, Jason Kuen, Henghui Ding, Zhihao Shu, Wei Niu, Pu Zhao, Yanzhi Wang, Jiuxiang Gu

TL;DR

LazyDiT introduces a cache-based lazy-learning framework to accelerate transformer-based diffusion models by skipping redundant computations across diffusion steps. The approach leverages a high output-similarity bound between consecutive steps and a linear-layer approximation (via Taylor expansion) to decide when to reuse cached results, guided by a laziness-focused loss. Empirical results show LazyDiT outperforms DDIM on ImageNet-scale diffusion models and delivers better mobile performance with competitive latency. This work enables real-time diffusion generation on edge devices while maintaining high-quality outputs and offers a principled balance between speed and accuracy through penalty regulation and ablation studies.

Abstract

Diffusion Transformers have emerged as the preeminent models for a wide array of generative tasks, demonstrating superior performance and efficacy across various applications. The promising results come at the cost of slow inference, as each denoising step requires running the whole transformer model with a large amount of parameters. In this paper, we show that performing the full computation of the model at each diffusion step is unnecessary, as some computations can be skipped by lazily reusing the results of previous steps. Furthermore, we show that the lower bound of similarity between outputs at consecutive steps is notably high, and this similarity can be linearly approximated using the inputs. To verify our demonstrations, we propose the \textbf{LazyDiT}, a lazy learning framework that efficiently leverages cached results from earlier steps to skip redundant computations. Specifically, we incorporate lazy learning layers into the model, effectively trained to maximize laziness, enabling dynamic skipping of redundant computations. Experimental results show that LazyDiT outperforms the DDIM sampler across multiple diffusion transformer models at various resolutions. Furthermore, we implement our method on mobile devices, achieving better performance than DDIM with similar latency. Code: https://github.com/shawnricecake/lazydit

LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers

TL;DR

LazyDiT introduces a cache-based lazy-learning framework to accelerate transformer-based diffusion models by skipping redundant computations across diffusion steps. The approach leverages a high output-similarity bound between consecutive steps and a linear-layer approximation (via Taylor expansion) to decide when to reuse cached results, guided by a laziness-focused loss. Empirical results show LazyDiT outperforms DDIM on ImageNet-scale diffusion models and delivers better mobile performance with competitive latency. This work enables real-time diffusion generation on edge devices while maintaining high-quality outputs and offers a principled balance between speed and accuracy through penalty regulation and ablation studies.

Abstract

Diffusion Transformers have emerged as the preeminent models for a wide array of generative tasks, demonstrating superior performance and efficacy across various applications. The promising results come at the cost of slow inference, as each denoising step requires running the whole transformer model with a large amount of parameters. In this paper, we show that performing the full computation of the model at each diffusion step is unnecessary, as some computations can be skipped by lazily reusing the results of previous steps. Furthermore, we show that the lower bound of similarity between outputs at consecutive steps is notably high, and this similarity can be linearly approximated using the inputs. To verify our demonstrations, we propose the \textbf{LazyDiT}, a lazy learning framework that efficiently leverages cached results from earlier steps to skip redundant computations. Specifically, we incorporate lazy learning layers into the model, effectively trained to maximize laziness, enabling dynamic skipping of redundant computations. Experimental results show that LazyDiT outperforms the DDIM sampler across multiple diffusion transformer models at various resolutions. Furthermore, we implement our method on mobile devices, achieving better performance than DDIM with similar latency. Code: https://github.com/shawnricecake/lazydit

Paper Structure

This paper contains 50 sections, 12 theorems, 44 equations, 7 figures, 7 tables.

Key Result

Theorem 1

There exist time-variant and condition-variant scalings and shiftings such that the distance between two inputs at consecutive steps for MHSA or Feedforward is bounded.

Figures (7)

  • Figure 1: Image generated by DiT-XL/2 in 512$\times$512 and 256$\times$256 resolutions when lazily skipping 50% computation. The upper rows display results from original model and the lower rows showcase outcomes of our method. Our method generates distinct lighting effects for background and color compared to the baseline, as demonstrated in dog and marmot, respectively.
  • Figure 2: Overview framework. We skip the computation of MHSA or Feedforward by calling the previous step cache.
  • Figure 3: Image visualization generated by DiT-XL/2 model in 256$\times$256 resolution on mobile. Images at the first and second rows are generated with 10 and 7 sampling steps. Images at the last row are generated with 30% lazy ratio.
  • Figure 4: Visualization for the laziness in MHSA and Feedforward at each layer generated through DDIM 20 steps on DiT-XL.
  • Figure 5: Upper figure: ablation for the generation performance with different individual laziness applied to each module independently. Lower figure: ablation for the generation performance with variant lazy ratio for one module and fixed lazy ratio for another module.
  • ...and 2 more figures

Theorems & Definitions (23)

  • Theorem 1: Scaling and shifting, informal version of Theorem 13 at Appendix C.2
  • Theorem 2: Similarity lower bound, informal version of Theorem 18 at Appendix C.4
  • Theorem 3: Linear layer approximation, informal version of Theorem 19 at Appendix C.5
  • proof
  • Definition 9: Self-attention module
  • Definition 10: Feedforward module
  • Lemma 11: Scaling and shifting for one row
  • proof
  • Lemma 12: Scaling and shifting
  • proof
  • ...and 13 more