Table of Contents
Fetching ...

Post-Training Quantization for Audio Diffusion Transformers

Tanmay Khandelwal, Magdalena Fuentes

TL;DR

This work addresses the practical deployment challenges of audio Diffusion Transformers (DiTs) by evaluating post-training quantization (PTQ) approaches. It introduces two extensions—denoising-timestep-aware smoothing (SQD) and a low-rank adaptor based on SVD—to mitigate activation outliers and residual weight errors, and also includes a static variant (SQS) for low-latency scenarios. Experiments on Stable Audio Open with AudioCaps demonstrate that dynamic SQD largely preserves FP32 quality across 8-bit and 4-bit settings, while static quantization remains competitive at 8-bit but degrades at 4-bit; LoRA helps at 8-bit but is less effective at 4-bit. The results show substantial memory reductions (up to about 79%) and feasible perceptual quality with quantized DiTs, paving the way for more efficient, real-time audio generation on consumer hardware.

Abstract

Diffusion Transformers (DiTs) enable high-quality audio synthesis but are often computationally intensive and require substantial storage, which limits their practical deployment. In this paper, we present a comprehensive evaluation of post-training quantization (PTQ) techniques for audio DiTs, analyzing the trade-offs between static and dynamic quantization schemes. We explore two practical extensions (1) a denoising-timestep-aware smoothing method that adapts quantization scales per-input-channel and timestep to mitigate activation outliers, and (2) a lightweight low-rank adapter (LoRA)-based branch derived from singular value decomposition (SVD) to compensate for residual weight errors. Using Stable Audio Open we benchmark W8A8 and W4A8 configurations across objective metrics and human perceptual ratings. Our results show that dynamic quantization preserves fidelity even at lower precision, while static methods remain competitive with lower latency. Overall, our findings show that low-precision DiTs can retain high-fidelity generation while reducing memory usage by up to 79%.

Post-Training Quantization for Audio Diffusion Transformers

TL;DR

This work addresses the practical deployment challenges of audio Diffusion Transformers (DiTs) by evaluating post-training quantization (PTQ) approaches. It introduces two extensions—denoising-timestep-aware smoothing (SQD) and a low-rank adaptor based on SVD—to mitigate activation outliers and residual weight errors, and also includes a static variant (SQS) for low-latency scenarios. Experiments on Stable Audio Open with AudioCaps demonstrate that dynamic SQD largely preserves FP32 quality across 8-bit and 4-bit settings, while static quantization remains competitive at 8-bit but degrades at 4-bit; LoRA helps at 8-bit but is less effective at 4-bit. The results show substantial memory reductions (up to about 79%) and feasible perceptual quality with quantized DiTs, paving the way for more efficient, real-time audio generation on consumer hardware.

Abstract

Diffusion Transformers (DiTs) enable high-quality audio synthesis but are often computationally intensive and require substantial storage, which limits their practical deployment. In this paper, we present a comprehensive evaluation of post-training quantization (PTQ) techniques for audio DiTs, analyzing the trade-offs between static and dynamic quantization schemes. We explore two practical extensions (1) a denoising-timestep-aware smoothing method that adapts quantization scales per-input-channel and timestep to mitigate activation outliers, and (2) a lightweight low-rank adapter (LoRA)-based branch derived from singular value decomposition (SVD) to compensate for residual weight errors. Using Stable Audio Open we benchmark W8A8 and W4A8 configurations across objective metrics and human perceptual ratings. Our results show that dynamic quantization preserves fidelity even at lower precision, while static methods remain competitive with lower latency. Overall, our findings show that low-precision DiTs can retain high-fidelity generation while reducing memory usage by up to 79%.

Paper Structure

This paper contains 6 sections, 9 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Activation map at denoising timestep 50 for DiT Block 24, showing activation values across tokens and input channels.
  • Figure 2: Visualization of input activation range across denoising timesteps (100 $\rightarrow$ 0) for Block 1. The shaded region represents the full activation span (min to max), while the solid line denotes the median activation. As denoising progresses, the range of activations increases significantly, highlighting the emergence of outliers in later steps.
  • Figure 3: Visualizing “easy vs. hard to quantize” regions and outliers. The spikes in activations (“outliers”) lead to low effective bits for other channels, whereas flatter distributions (“smoothed”) are more amenable to quantization.
  • Figure 4: Subjective evaluation of mean user ratings (1–5 scale) in W8A8 for the full-precision baseline (baseline), the fastest variant (Model SQS), and the best-performing model (Model SQD+LoRA).