Table of Contents
Fetching ...

AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Huadai Liu, Rongjie Huang, Yang Liu, Hengyuan Cao, Jialei Wang, Xize Cheng, Siqi Zheng, Zhou Zhao

TL;DR

AudioLCM introduces a latent consistency-model framework for text-to-audio generation that achieves high fidelity with only a few iterations. By integrating Consistency Models into the audio latent space and employing a One-stage Guided Latent Consistency Distillation with a multi-step ODE solver, it shortens the diffusion schedule from thousands to dozens while preserving quality. The approach is complemented by a Transformer backbone inspired by LLaMA and a CFG-enhanced distillation process, enabling robust training and stable inference. Empirical results on text-to-audio and text-to-music tasks demonstrate 2-step generation with up to 333x real-time speed on a single NVIDIA 4090Ti GPU, making efficient, high-quality audio synthesis practically deployable.

Abstract

Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. AudioLCM integrates Consistency Models into the generation process, facilitating rapid inference through a mapping from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of steps while maintaining sample quality, thereby achieving fast convergence and high-quality generation. Furthermore, to optimize the performance of transformer-based neural network architectures, we integrate the advanced techniques pioneered by LLaMA into the foundational framework of transformers. This architecture supports stable and efficient training, ensuring robust performance in text-to-audio synthesis. Experimental results on text-to-sound generation and text-to-music synthesis tasks demonstrate that AudioLCM needs only 2 iterations to synthesize high-fidelity audios, while it maintains sample quality competitive with state-of-the-art models using hundreds of steps. AudioLCM enables a sampling speed of 333x faster than real-time on a single NVIDIA 4090Ti GPU, making generative models practically applicable to text-to-audio generation deployment. Our extensive preliminary analysis shows that each design in AudioLCM is effective.

AudioLCM: Text-to-Audio Generation with Latent Consistency Models

TL;DR

AudioLCM introduces a latent consistency-model framework for text-to-audio generation that achieves high fidelity with only a few iterations. By integrating Consistency Models into the audio latent space and employing a One-stage Guided Latent Consistency Distillation with a multi-step ODE solver, it shortens the diffusion schedule from thousands to dozens while preserving quality. The approach is complemented by a Transformer backbone inspired by LLaMA and a CFG-enhanced distillation process, enabling robust training and stable inference. Empirical results on text-to-audio and text-to-music tasks demonstrate 2-step generation with up to 333x real-time speed on a single NVIDIA 4090Ti GPU, making efficient, high-quality audio synthesis practically deployable.

Abstract

Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. AudioLCM integrates Consistency Models into the generation process, facilitating rapid inference through a mapping from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of steps while maintaining sample quality, thereby achieving fast convergence and high-quality generation. Furthermore, to optimize the performance of transformer-based neural network architectures, we integrate the advanced techniques pioneered by LLaMA into the foundational framework of transformers. This architecture supports stable and efficient training, ensuring robust performance in text-to-audio synthesis. Experimental results on text-to-sound generation and text-to-music synthesis tasks demonstrate that AudioLCM needs only 2 iterations to synthesize high-fidelity audios, while it maintains sample quality competitive with state-of-the-art models using hundreds of steps. AudioLCM enables a sampling speed of 333x faster than real-time on a single NVIDIA 4090Ti GPU, making generative models practically applicable to text-to-audio generation deployment. Our extensive preliminary analysis shows that each design in AudioLCM is effective.
Paper Structure (39 sections, 20 equations, 4 figures, 7 tables, 2 algorithms)

This paper contains 39 sections, 20 equations, 4 figures, 7 tables, 2 algorithms.

Figures (4)

  • Figure 1: An illustration of AudioLCM. AudioLCM propose the Guided Consistency Distillation with $k$-step ODE solver. ${\bm{c}}$ is the text embedding and $\omega$ is the classifier-free guidance scale.
  • Figure 2: In subfigure (a), we assess the correlation between the audio quality and the estimate interval $k$ of ODE solver across the test set. In subfigure (b), we delves into the examination of how different scales of classifier-free guidance contribute to the overall performance of FAD.
  • Figure 3: We evaluate the relationship between the inference latency and sample quality measured by FAD.
  • Figure 4: Screenshots of subjective evaluations.