Table of Contents
Fetching ...

PTQ4ADM: Post-Training Quantization for Efficient Text Conditional Audio Diffusion Models

Jayneel Vora, Aditya Krishnan, Nader Bouacida, Prabhu RV Shankar, Prasant Mohapatra

TL;DR

PTQ4ADM is introduced, a novel framework for quantizing audio diffusion models that has the capability to reduce the model size by up to 70% while achieving synthesis quality metrics comparable to full-precision models and it is shown that specific layers in the backbone network can be quantized to 4-bit weights and 8-bit activations without significant quality loss.

Abstract

Denoising diffusion models have emerged as state-of-the-art in generative tasks across image, audio, and video domains, producing high-quality, diverse, and contextually relevant data. However, their broader adoption is limited by high computational costs and large memory footprints. Post-training quantization (PTQ) offers a promising approach to mitigate these challenges by reducing model complexity through low-bandwidth parameters. Yet, direct application of PTQ to diffusion models can degrade synthesis quality due to accumulated quantization noise across multiple denoising steps, particularly in conditional tasks like text-to-audio synthesis. This work introduces PTQ4ADM, a novel framework for quantizing audio diffusion models(ADMs). Our key contributions include (1) a coverage-driven prompt augmentation method and (2) an activation-aware calibration set generation algorithm for text-conditional ADMs. These techniques ensure comprehensive coverage of audio aspects and modalities while preserving synthesis fidelity. We validate our approach on TANGO, Make-An-Audio, and AudioLDM models for text-conditional audio generation. Extensive experiments demonstrate PTQ4ADM's capability to reduce the model size by up to 70\% while achieving synthesis quality metrics comparable to full-precision models($<$5\% increase in FD scores). We show that specific layers in the backbone network can be quantized to 4-bit weights and 8-bit activations without significant quality loss. This work paves the way for more efficient deployment of ADMs in resource-constrained environments.

PTQ4ADM: Post-Training Quantization for Efficient Text Conditional Audio Diffusion Models

TL;DR

PTQ4ADM is introduced, a novel framework for quantizing audio diffusion models that has the capability to reduce the model size by up to 70% while achieving synthesis quality metrics comparable to full-precision models and it is shown that specific layers in the backbone network can be quantized to 4-bit weights and 8-bit activations without significant quality loss.

Abstract

Denoising diffusion models have emerged as state-of-the-art in generative tasks across image, audio, and video domains, producing high-quality, diverse, and contextually relevant data. However, their broader adoption is limited by high computational costs and large memory footprints. Post-training quantization (PTQ) offers a promising approach to mitigate these challenges by reducing model complexity through low-bandwidth parameters. Yet, direct application of PTQ to diffusion models can degrade synthesis quality due to accumulated quantization noise across multiple denoising steps, particularly in conditional tasks like text-to-audio synthesis. This work introduces PTQ4ADM, a novel framework for quantizing audio diffusion models(ADMs). Our key contributions include (1) a coverage-driven prompt augmentation method and (2) an activation-aware calibration set generation algorithm for text-conditional ADMs. These techniques ensure comprehensive coverage of audio aspects and modalities while preserving synthesis fidelity. We validate our approach on TANGO, Make-An-Audio, and AudioLDM models for text-conditional audio generation. Extensive experiments demonstrate PTQ4ADM's capability to reduce the model size by up to 70\% while achieving synthesis quality metrics comparable to full-precision models(5\% increase in FD scores). We show that specific layers in the backbone network can be quantized to 4-bit weights and 8-bit activations without significant quality loss. This work paves the way for more efficient deployment of ADMs in resource-constrained environments.
Paper Structure (11 sections, 3 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 11 sections, 3 equations, 4 figures, 1 table, 2 algorithms.

Figures (4)

  • Figure 1: A diagrammatical representation of the generation of the intermediate for the calibration set using the PTQ4ADM framework.
  • Figure 2: Activation distributions within the TimestepEmbedSequential block, located in the input layers of the Make-An-Audio diffusion model, for two distinct prompts: (1) a single, continuous alarm beep, and (2) a surreal, dissonant melody characterized by mechanical grinding, sporadic bursts of static, and alien-like vocalizations.
  • Figure 3: FAD score for varying bitwidths for the Conv2d layer of the U-Net using uniform quantization without calibration across considered ADMs
  • Figure 4: FD score of an 8W16A quantized Make-An-Audio model with various enhanced prompts based on 100 sampled captions from the AudioCaps dataset. The analysis includes intermediate sampling across timesteps- Random, Normal, and Ours along with enhanced prompts vs vanilla prompts from the dataset.