TR-DQ: Time-Rotation Diffusion Quantization
Yihua Shao, Deyang Lin, Fanhu Zeng, Minxi Yan, Muyang Zhang, Siyu Chen, Yuxuan Fan, Ziyang Yan, Haozhe Wang, Jingcai Guo, Yan Wang, Haotong Qin, Hao Tang
TL;DR
TR-DQ tackles diffusion-model quantization by introducing time-step aware rotation-based quantization that dynamically adapts rotations, diagonals, and permutations per time step to smooth activations and shift challenging dynamics into weights, formalized with time-dependent matrices $\mathbf{R}_t$, $\boldsymbol{\Delta}_t$, and $\mathbf{P}_t$. It further leverages Attention-Sharing to exploit high similarity between CFG and non-CFG attention blocks, reducing computation without large quality loss. The approach achieves state-of-the-art performance on image and video generation after quantization, delivering a practical speedup of $1.38$–$1.89$× and memory reduction of $1.97$–$2.58$× compared to existing quantization methods. By enabling finer-grained, time-aware quantization and targeted attention sharing, TR-DQ facilitates efficient deployment of diffusion models on resource-constrained hardware while maintaining high visual fidelity and temporal coherence.$
Abstract
Diffusion models have been widely adopted in image and video generation. However, their complex network architecture leads to high inference overhead for its generation process. Existing diffusion quantization methods primarily focus on the quantization of the model structure while ignoring the impact of time-steps variation during sampling. At the same time, most current approaches fail to account for significant activations that cannot be eliminated, resulting in substantial performance degradation after quantization. To address these issues, we propose Time-Rotation Diffusion Quantization (TR-DQ), a novel quantization method incorporating time-step and rotation-based optimization. TR-DQ first divides the sampling process based on time-steps and applies a rotation matrix to smooth activations and weights dynamically. For different time-steps, a dedicated hyperparameter is introduced for adaptive timing modeling, which enables dynamic quantization across different time steps. Additionally, we also explore the compression potential of Classifier-Free Guidance (CFG-wise) to establish a foundation for subsequent work. TR-DQ achieves state-of-the-art (SOTA) performance on image generation and video generation tasks and a 1.38-1.89x speedup and 1.97-2.58x memory reduction in inference compared to existing quantization methods.
