Table of Contents
Fetching ...

ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, Somayeh Sojoudi

TL;DR

The paper tackles the slow inference of diffusion-based text-to-audio generation by introducing ConsistencyTTA, a CFG-aware latent-space consistency model that requires only a single network query per generation. It integrates classifier-free guidance into consistency distillation and enables closed-loop finetuning using audio-text metrics like CLAP, achieving approximately 400x computational savings on AudioCaps with comparable quality and diversity to state-of-the-art diffusion methods. The approach leverages latent-space generation, a GUID-based distillation framework, and on-device deployment capabilities, significantly broadening practical applications of TTA. Overall, ConsistencyTTA provides a practical, high-quality, and fast solution for real-time TTA and similar conditional generation tasks.

Abstract

Diffusion models are instrumental in text-to-audio (TTA) generation. Unfortunately, they suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation. To address this bottleneck, we introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query, thereby accelerating TTA by hundreds of times. We achieve so by proposing "CFG-aware latent consistency model," which adapts consistency generation into a latent space and incorporates classifier-free guidance (CFG) into model training. Moreover, unlike diffusion models, ConsistencyTTA can be finetuned closed-loop with audio-space text-aware metrics, such as CLAP score, to further enhance the generations. Our objective and subjective evaluation on the AudioCaps dataset shows that compared to diffusion-based counterparts, ConsistencyTTA reduces inference computation by 400x while retaining generation quality and diversity.

ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

TL;DR

The paper tackles the slow inference of diffusion-based text-to-audio generation by introducing ConsistencyTTA, a CFG-aware latent-space consistency model that requires only a single network query per generation. It integrates classifier-free guidance into consistency distillation and enables closed-loop finetuning using audio-text metrics like CLAP, achieving approximately 400x computational savings on AudioCaps with comparable quality and diversity to state-of-the-art diffusion methods. The approach leverages latent-space generation, a GUID-based distillation framework, and on-device deployment capabilities, significantly broadening practical applications of TTA. Overall, ConsistencyTTA provides a practical, high-quality, and fast solution for real-time TTA and similar conditional generation tasks.

Abstract

Diffusion models are instrumental in text-to-audio (TTA) generation. Unfortunately, they suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation. To address this bottleneck, we introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query, thereby accelerating TTA by hundreds of times. We achieve so by proposing "CFG-aware latent consistency model," which adapts consistency generation into a latent space and incorporates classifier-free guidance (CFG) into model training. Moreover, unlike diffusion models, ConsistencyTTA can be finetuned closed-loop with audio-space text-aware metrics, such as CLAP score, to further enhance the generations. Our objective and subjective evaluation on the AudioCaps dataset shows that compared to diffusion-based counterparts, ConsistencyTTA reduces inference computation by 400x while retaining generation quality and diversity.
Paper Structure (29 sections, 4 equations, 3 figures, 4 tables)

This paper contains 29 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: ConsistencyTTA achieves a 400x computation reduction compared with a diffusion baseline model while sacrificing much less quality than traditional acceleration methods.
  • Figure 2: ConsistencyTTA checkpoints in \ref{['tab:main_results']} with different CFG weights.
  • Figure 3: Consistency model generated Mel spectrograms from the first 50 AudioCaps prompts with four different seeds. Each row corresponds to a prompt, and each column corresponds to a seed. The generations from a prompt with different seeds are correlated but distinctly different.