Table of Contents
Fetching ...

DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching

Chang Zou, Changlin Li, Yang Li, Patrol Li, Jianbing Wu, Xiao He, Songtao Liu, Zhao Zhong, Kailin Huang, Linfeng Zhang

TL;DR

Video diffusion models deliver high-quality results but are prohibitively expensive to run at scale. This work introduces DisCa, a distillation-aware acceleration framework that couples a lightweight, learnable feature predictor with Restricted MeanFlow to stabilize distillation and preserve details. A cache-based training regime paired with generative adversarial training enables the predictor to recover high-frequency structure, achieving up to $11.8\\times$ speedups with minimal additional VRAM. Evaluations on a state-of-the-art video DiT show near-lossless quality under high acceleration, making practical high-resolution video generation more feasible.

Abstract

While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code is in the supplementary materials and will be publicly available.

DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching

TL;DR

Video diffusion models deliver high-quality results but are prohibitively expensive to run at scale. This work introduces DisCa, a distillation-aware acceleration framework that couples a lightweight, learnable feature predictor with Restricted MeanFlow to stabilize distillation and preserve details. A cache-based training regime paired with generative adversarial training enables the predictor to recover high-frequency structure, achieving up to speedups with minimal additional VRAM. Evaluations on a state-of-the-art video DiT show near-lossless quality under high acceleration, making practical high-resolution video generation more feasible.

Abstract

While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code is in the supplementary materials and will be publicly available.
Paper Structure (37 sections, 8 equations, 7 figures, 4 tables, 3 algorithms)

This paper contains 37 sections, 8 equations, 7 figures, 4 tables, 3 algorithms.

Figures (7)

  • Figure 1: Feature Caching on the diffusion sampling process w/ and w/o step-distillation. (a) Adjacent timesteps are similar under the undistilled models, allowing traditional caching with simple reuse/interpolation. (b) Significant inter-step differences cause traditional caching to fail; the proposed learnable predictor captures the high-dimensional feature evolution successfully.
  • Figure 2: An overview of Distillation-Compatible Learnable Feature Caching (DisCa).(a) The inference procedure under the proposed Learnable Feature Caching framework. The lightweight Predictor $\mathcal{P}$ performs multi-step fast inference after a single computation pass through the large-scale DiT $\mathcal{M}$. (b) The training process of Predictor. The cache, initialized by the DiT, is fed into the Predictor as part of the input. The outputs of the Predictor and DiT are passed to the discriminator $\mathcal{D}$, alternating between the objectives of maximizing and minimizing $\mathcal{L}_{\mathcal{D}}$ as part of the adversarial game. (c) The lightweight Predictor with two DiT Blocks, typically constitutes less than 4% of the total size of the DiT, enabling high-speed and accurate inference. (d) The Restricted MeanFlow is constructed primarily by pruning the components with a high compression ratio in the original MeanFlow, thereby facilitating the learning of the local mean velocity.
  • Figure 3: Qualitative Comparison for MeanFlow and the proposed Restrict MeanFlow. In the video generation scenarios for both 20 steps and 10 steps, the MeanFlow method exhibits noticeable distortion and artifacts. In contrast, the Restricted MeanFlow maintains high quality, as indicated by the metrics in Table \ref{['table:Restriced Meanflow']}.
  • Figure 4: Visualization of acceleration methods on HunyuanVideo. In the discussed high acceleration ratio scenarios, previous methods exhibit severe degradation, such as malformation and blurring, while DisCa successfully maintains high quality with a $11.8\times$ acceleration.
  • Figure 5: Loss curve during the Generative-Adversarial training process. The discriminator $\mathcal{D}$ and predictor $\mathcal{P}$ exhibit a stable adversarial dynamic, enhancing the generating capability.
  • ...and 2 more figures