Table of Contents
Fetching ...

Adaptive Hybrid Caching for Efficient Text-to-Video Diffusion Model Acceleration

Yuanxin Wei, Lansong Diao, Bujiao Chen, Shenggan Cheng, Zhengping Qian, Wenyuan Yu, Nong Xiao, Wei Lin, Jiangsu Du

TL;DR

This work proposes MixCache, a training-free caching-based framework for efficient video DiT inference that first distinguishes the interference and boundary between different caching strategies, and then introduces a context-aware cache triggering strategy to determine when caching should be enabled.

Abstract

Efficient video generation models are increasingly vital for multimedia synthetic content generation. Leveraging the Transformer architecture and the diffusion process, video DiT models have emerged as a dominant approach for high-quality video generation. However, their multi-step iterative denoising process incurs high computational cost and inference latency. Caching, a widely adopted optimization method in DiT models, leverages the redundancy in the diffusion process to skip computations in different granularities (e.g., step, cfg, block). Nevertheless, existing caching methods are limited to single-granularity strategies, struggling to balance generation quality and inference speed in a flexible manner. In this work, we propose MixCache, a training-free caching-based framework for efficient video DiT inference. It first distinguishes the interference and boundary between different caching strategies, and then introduces a context-aware cache triggering strategy to determine when caching should be enabled, along with an adaptive hybrid cache decision strategy for dynamically selecting the optimal caching granularity. Extensive experiments on diverse models demonstrate that, MixCache can significantly accelerate video generation (e.g., 1.94$\times$ speedup on Wan 14B, 1.97$\times$ speedup on HunyuanVideo) while delivering both superior generation quality and inference efficiency compared to baseline methods.

Adaptive Hybrid Caching for Efficient Text-to-Video Diffusion Model Acceleration

TL;DR

This work proposes MixCache, a training-free caching-based framework for efficient video DiT inference that first distinguishes the interference and boundary between different caching strategies, and then introduces a context-aware cache triggering strategy to determine when caching should be enabled.

Abstract

Efficient video generation models are increasingly vital for multimedia synthetic content generation. Leveraging the Transformer architecture and the diffusion process, video DiT models have emerged as a dominant approach for high-quality video generation. However, their multi-step iterative denoising process incurs high computational cost and inference latency. Caching, a widely adopted optimization method in DiT models, leverages the redundancy in the diffusion process to skip computations in different granularities (e.g., step, cfg, block). Nevertheless, existing caching methods are limited to single-granularity strategies, struggling to balance generation quality and inference speed in a flexible manner. In this work, we propose MixCache, a training-free caching-based framework for efficient video DiT inference. It first distinguishes the interference and boundary between different caching strategies, and then introduces a context-aware cache triggering strategy to determine when caching should be enabled, along with an adaptive hybrid cache decision strategy for dynamically selecting the optimal caching granularity. Extensive experiments on diverse models demonstrate that, MixCache can significantly accelerate video generation (e.g., 1.94 speedup on Wan 14B, 1.97 speedup on HunyuanVideo) while delivering both superior generation quality and inference efficiency compared to baseline methods.

Paper Structure

This paper contains 18 sections, 7 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: MixCache visualization across video DiT models.
  • Figure 2: Three levels of redundancy across denoising timesteps in Wan 14B 480p and HunyuanVideo 540p.
  • Figure 3: Similarity metric compared with the original model using different cache strategies in different timesteps.
  • Figure 4: The MixCache framework.
  • Figure 5: (a) left: Distance between the perturbed and original output, which performs as the impact indicator. (b) right: Visualization and quantitative metrics (LPIPS ↓ PSNR ↑ SSIM ↑) of different level interference at different diffusion stages.
  • ...and 7 more figures