Table of Contents
Fetching ...

QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation

Junyi Wu, Zhiteng Li, Zheng Hui, Yulun Zhang, Linghe Kong, Xiaokang Yang

TL;DR

QuantCache addresses the heavy compute and memory demands of Diffusion Transformers for video generation by jointly optimizing hierarchical latent caching (HLC), adaptive importance-guided quantization (AIGQ), and structural redundancy-aware pruning (SRAP) in a training-free framework. It introduces a dynamic, content-aware compute allocation scheme with a feature-divergence based caching policy and a per-timestep, per-layer mixed-precision quantization budget, complemented by online pruning of redundant layers within a timestep. The method delivers a reported end-to-end speedup of $6.72\\times$ on Open-Sora with minimal quality loss and outperforms prior cache and quantization baselines on multiple benchmarks, supported by optimized CUDA kernels. This work provides a scalable, hardware-friendly approach to efficient DiT inference, enabling practical video generation on edge and resource-limited platforms.

Abstract

Recently, Diffusion Transformers (DiTs) have emerged as a dominant architecture in video generation, surpassing U-Net-based models in terms of performance. However, the enhanced capabilities of DiTs come with significant drawbacks, including increased computational and memory costs, which hinder their deployment on resource-constrained devices. Current acceleration techniques, such as quantization and cache mechanism, offer limited speedup and are often applied in isolation, failing to fully address the complexities of DiT architectures. In this paper, we propose QuantCache, a novel training-free inference acceleration framework that jointly optimizes hierarchical latent caching, adaptive importance-guided quantization, and structural redundancy-aware pruning. QuantCache achieves an end-to-end latency speedup of 6.72$\times$ on Open-Sora with minimal loss in generation quality. Extensive experiments across multiple video generation benchmarks demonstrate the effectiveness of our method, setting a new standard for efficient DiT inference. The code and models will be available at https://github.com/JunyiWuCode/QuantCache.

QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation

TL;DR

QuantCache addresses the heavy compute and memory demands of Diffusion Transformers for video generation by jointly optimizing hierarchical latent caching (HLC), adaptive importance-guided quantization (AIGQ), and structural redundancy-aware pruning (SRAP) in a training-free framework. It introduces a dynamic, content-aware compute allocation scheme with a feature-divergence based caching policy and a per-timestep, per-layer mixed-precision quantization budget, complemented by online pruning of redundant layers within a timestep. The method delivers a reported end-to-end speedup of on Open-Sora with minimal quality loss and outperforms prior cache and quantization baselines on multiple benchmarks, supported by optimized CUDA kernels. This work provides a scalable, hardware-friendly approach to efficient DiT inference, enabling practical video generation on edge and resource-limited platforms.

Abstract

Recently, Diffusion Transformers (DiTs) have emerged as a dominant architecture in video generation, surpassing U-Net-based models in terms of performance. However, the enhanced capabilities of DiTs come with significant drawbacks, including increased computational and memory costs, which hinder their deployment on resource-constrained devices. Current acceleration techniques, such as quantization and cache mechanism, offer limited speedup and are often applied in isolation, failing to fully address the complexities of DiT architectures. In this paper, we propose QuantCache, a novel training-free inference acceleration framework that jointly optimizes hierarchical latent caching, adaptive importance-guided quantization, and structural redundancy-aware pruning. QuantCache achieves an end-to-end latency speedup of 6.72 on Open-Sora with minimal loss in generation quality. Extensive experiments across multiple video generation benchmarks demonstrate the effectiveness of our method, setting a new standard for efficient DiT inference. The code and models will be available at https://github.com/JunyiWuCode/QuantCache.

Paper Structure

This paper contains 15 sections, 11 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The overview of $\texttt{QuantCache}$ with (a) HLC, (b) AIGQ, (c) SRAP. STA, CA, and FFN respectively refer to spatial-temporal attention, cross attention, and feedforward network in a Transformer layer.
  • Figure 2: AIGQ: adaptive importance-guided quantization.
  • Figure 3: Spatial and temporal differences across adjacent layers for spatial-temporal attention, cross-attention, and feed-forward network.