Table of Contents
Fetching ...

LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

Peiliang Cai, Jiacheng Liu, Haowen Xu, Xinyu Wang, Chang Zou, Linfeng Zhang

TL;DR

A LEarnable Stage-Aware (LESA) predictor framework based on two-stage training that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting and state-of-the-art performance on both text-to-image and text-to-video synthesis.

Abstract

Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00x acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25x speedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA (TaylorSeer), and 5.00x acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is included in the supplementary materials and will be released on GitHub.

LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

TL;DR

A LEarnable Stage-Aware (LESA) predictor framework based on two-stage training that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting and state-of-the-art performance on both text-to-image and text-to-video synthesis.

Abstract

Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00x acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25x speedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA (TaylorSeer), and 5.00x acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is included in the supplementary materials and will be released on GitHub.
Paper Structure (35 sections, 14 equations, 7 figures, 9 tables)

This paper contains 35 sections, 14 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Images generated by Qwen-image using LESA with a 6.25$\times$ acceleration.
  • Figure 2: Cosine Similarity curves and PCA-projected trajectories show that feature evolution differs markedly across diffusion models, with non-smooth, stage-dependent dynamics that challenge the common assumption of simple or continuous temporal change.
  • Figure 3: Overview of the training-based stage-aware learnable predictor framework.(a) Training progress of the LESA uses outputs from the previous $K$ steps as input and updates the predictor with the DiT activation output. (b) Inference process with the learned predictor skips part of the DiT computations by predicting features. (c) Predictor architecture processes features from previous steps with a linear projection and processes the timestep with a KAN module, whose outputs are multiplied to generate the feature at step $t$. (d) Stage-aware denoising divides the trajectory into three stages and assigns a dedicated predictor to each stage.
  • Figure 4: Comparison of LESA and TaylorSeer under different speedup ratios.LESA preserves better quality metrics and perceptual metrics On QwenImage and FLUX.
  • Figure 5: Comparison of Sampling Process Image and Final Image. On FLUX, LESA demonstrates significant speedup ratio with excellent spatial organization and accurate color representation.
  • ...and 2 more figures