Table of Contents
Fetching ...

AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion

Junqi Zhao, Jinzheng Zhao, Haohe Liu, Yun Chen, Lu Han, Xubo Liu, Mark Plumbley, Wenwu Wang

TL;DR

AudioTurbo presents a fast text-to-audio generation method by integrating rectified diffusion with a pre-trained TTA model (Auffusion). It learns first-order ODE paths via deterministic noise–data coupling, enabling high-quality audio at as few as 3–10 sampling steps and outperforming flow-based accelerators on AudioCaps. The approach uses a CLIP-based text encoder, a rectified-diffusion latent diffusion model trained with teacher-generated pairs, and classifier-free guidance to balance fidelity and diversity. This combination achieves strong text–audio alignment and perceptual quality with significantly reduced inference time, advancing real-time TTA applications and offering a path toward distillation and broader audio-generation tasks.

Abstract

Diffusion models have significantly improved the quality and diversity of audio generation but are hindered by slow inference speed. Rectified flow enhances inference speed by learning straight-line ordinary differential equation (ODE) paths. However, this approach requires training a flow-matching model from scratch and tends to perform suboptimally, or even poorly, at low step counts. To address the limitations of rectified flow while leveraging the advantages of advanced pre-trained diffusion models, this study integrates pre-trained models with the rectified diffusion method to improve the efficiency of text-to-audio (TTA) generation. Specifically, we propose AudioTurbo, which learns first-order ODE paths from deterministic noise sample pairs generated by a pre-trained TTA model. Experiments on the AudioCaps dataset demonstrate that our model, with only 10 sampling steps, outperforms prior models and reduces inference to 3 steps compared to a flow-matching-based acceleration model.

AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion

TL;DR

AudioTurbo presents a fast text-to-audio generation method by integrating rectified diffusion with a pre-trained TTA model (Auffusion). It learns first-order ODE paths via deterministic noise–data coupling, enabling high-quality audio at as few as 3–10 sampling steps and outperforming flow-based accelerators on AudioCaps. The approach uses a CLIP-based text encoder, a rectified-diffusion latent diffusion model trained with teacher-generated pairs, and classifier-free guidance to balance fidelity and diversity. This combination achieves strong text–audio alignment and perceptual quality with significantly reduced inference time, advancing real-time TTA applications and offering a path toward distillation and broader audio-generation tasks.

Abstract

Diffusion models have significantly improved the quality and diversity of audio generation but are hindered by slow inference speed. Rectified flow enhances inference speed by learning straight-line ordinary differential equation (ODE) paths. However, this approach requires training a flow-matching model from scratch and tends to perform suboptimally, or even poorly, at low step counts. To address the limitations of rectified flow while leveraging the advantages of advanced pre-trained diffusion models, this study integrates pre-trained models with the rectified diffusion method to improve the efficiency of text-to-audio (TTA) generation. Specifically, we propose AudioTurbo, which learns first-order ODE paths from deterministic noise sample pairs generated by a pre-trained TTA model. Experiments on the AudioCaps dataset demonstrate that our model, with only 10 sampling steps, outperforms prior models and reduces inference to 3 steps compared to a flow-matching-based acceleration model.

Paper Structure

This paper contains 17 sections, 9 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: An overview of AudioTurbo architecture. Note that the trainable parameters are initialized using the pretrained TTA model, Auffusion.