Table of Contents
Fetching ...

6Bit-Diffusion: Inference-Time Mixed-Precision Quantization for Video Diffusion Models

Rundong Su, Jintao Zhang, Zhihang Yuan, Haojie Duanmu, Jianfei Chen, Jun Zhu

Abstract

Diffusion transformers have demonstrated remarkable capabilities in generating videos. However, their practical deployment is severely constrained by high memory usage and computational cost. Post-Training Quantization provides a practical way to reduce memory usage and boost computation speed. Existing quantization methods typically apply a static bit-width allocation, overlooking the quantization difficulty of activations across diffusion timesteps, leading to a suboptimal trade-off between efficiency and quality. In this paper, we propose a inference time NVFP4/INT8 Mixed-Precision Quantization framework. We find a strong linear correlation between a block's input-output difference and the quantization sensitivity of its internal linear layers. Based on this insight, we design a lightweight predictor that dynamically allocates NVFP4 to temporally stable layers to maximize memory compression, while selectively preserving INT8 for volatile layers to ensure robustness. This adaptive precision strategy enables aggressive quantization without compromising generation quality. Beside this, we observe that the residual between the input and output of a Transformer block exhibits high temporal consistency across timesteps. Leveraging this temporal redundancy, we introduce Temporal Delta Cache (TDC) to skip computations for these invariant blocks, further reducing the computational cost. Extensive experiments demonstrate that our method achieves 1.92$\times$ end-to-end acceleration and 3.32$\times$ memory reduction, setting a new baseline for efficient inference in Video DiTs.

6Bit-Diffusion: Inference-Time Mixed-Precision Quantization for Video Diffusion Models

Abstract

Diffusion transformers have demonstrated remarkable capabilities in generating videos. However, their practical deployment is severely constrained by high memory usage and computational cost. Post-Training Quantization provides a practical way to reduce memory usage and boost computation speed. Existing quantization methods typically apply a static bit-width allocation, overlooking the quantization difficulty of activations across diffusion timesteps, leading to a suboptimal trade-off between efficiency and quality. In this paper, we propose a inference time NVFP4/INT8 Mixed-Precision Quantization framework. We find a strong linear correlation between a block's input-output difference and the quantization sensitivity of its internal linear layers. Based on this insight, we design a lightweight predictor that dynamically allocates NVFP4 to temporally stable layers to maximize memory compression, while selectively preserving INT8 for volatile layers to ensure robustness. This adaptive precision strategy enables aggressive quantization without compromising generation quality. Beside this, we observe that the residual between the input and output of a Transformer block exhibits high temporal consistency across timesteps. Leveraging this temporal redundancy, we introduce Temporal Delta Cache (TDC) to skip computations for these invariant blocks, further reducing the computational cost. Extensive experiments demonstrate that our method achieves 1.92 end-to-end acceleration and 3.32 memory reduction, setting a new baseline for efficient inference in Video DiTs.
Paper Structure (16 sections, 12 equations, 5 figures, 3 tables)

This paper contains 16 sections, 12 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Relative L2 quantization error(defined in Eq. \ref{['eq:r_l2']}) of individual linear layers in CogVideoX yang2024cogvideox across denoising timesteps. The severe temporal fluctuations demonstrate that activation sensitivity to quantization is highly dynamic, highlighting the limitations of static mixed-precision policies.
  • Figure 2: Overview of our proposed methods. We using DMPQ to decide the quantization bits of each linear layer activation based on the block preceding timestep input output relative L1 loss $\Gamma$. And using TDC to decide whether to skip block computation based on preceding two timesteps block updates $\Delta$.
  • Figure 3: The linear relationship between the block-level input-output relative L1 distance at the previous timestep ($\Gamma_{t-1}$) and the layer-wise relative quantization error ($E_{rel}$) at the current timestep. The data points are collected by executing the diffusion model over a calibration set. For each block, we extract the time-shifted valid data pairs across all timesteps to perform linear regression ($y = \alpha x + \beta$) for each internal linear layer.
  • Figure 4: Temporal redundancy of Transformer block updates in Video DiTs. We visualize the Cosine Similarity (left) and Relative L2 Difference (right) of the residual updates ($\Delta_t^l$ and $\Delta_{t-1}^l$) between adjacent timesteps. The inherently high similarity across most timesteps and layers motivates our adaptive Temporal Delta Cache.
  • Figure 5: Visual comparisons between ours and FP16 baseline yang2024cogvideox), together with quantization methods ashkboos2024quarot, SmoothQuant xiao2023smoothquant and ViDiT-Q zhao2024vidit