Table of Contents
Fetching ...

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

Jinlong Xue, Yayue Deng, Yingming Gao, Ya Li

TL;DR

Auffusion transfers the strong generative capacity and cross-modal alignment of pretrained text-to-image diffusion models to text-to-audio tasks, achieving high-quality audio with limited data and resources. By transforming audio into a latent, image-like representation and employing a cross-attention conditioned latent diffusion process, it attains superior text-audio alignment, validated through both objective metrics and human judgments. The work also provides a systematic study of text encoders and visualizes cross-attention maps to diagnose alignment, revealing that pretrained LDMs offer robust transfer of cross-modal understanding to TTA. This approach enables versatile audio manipulations, including style transfer, inpainting, and token-level attention control, with practical implications for scalable and controllable audio generation.

Abstract

Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. Furthermore, previous studies in T2I recognizes the significant impact of encoder choice on cross-modal alignment, like fine-grained details and object bindings, while similar evaluation is lacking in prior TTA works. Through comprehensive ablation studies and innovative cross-attention map visualizations, we provide insightful assessments of text-audio alignment in TTA. Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions, which further demonstrated in several related tasks, such as audio style transfer, inpainting and other manipulations. Our implementation and demos are available at https://auffusion.github.io.

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

TL;DR

Auffusion transfers the strong generative capacity and cross-modal alignment of pretrained text-to-image diffusion models to text-to-audio tasks, achieving high-quality audio with limited data and resources. By transforming audio into a latent, image-like representation and employing a cross-attention conditioned latent diffusion process, it attains superior text-audio alignment, validated through both objective metrics and human judgments. The work also provides a systematic study of text encoders and visualizes cross-attention maps to diagnose alignment, revealing that pretrained LDMs offer robust transfer of cross-modal understanding to TTA. This approach enables versatile audio manipulations, including style transfer, inpainting, and token-level attention control, with practical implications for scalable and controllable audio generation.

Abstract

Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. Furthermore, previous studies in T2I recognizes the significant impact of encoder choice on cross-modal alignment, like fine-grained details and object bindings, while similar evaluation is lacking in prior TTA works. Through comprehensive ablation studies and innovative cross-attention map visualizations, we provide insightful assessments of text-audio alignment in TTA. Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions, which further demonstrated in several related tasks, such as audio style transfer, inpainting and other manipulations. Our implementation and demos are available at https://auffusion.github.io.
Paper Structure (29 sections, 8 equations, 10 figures, 5 tables)

This paper contains 29 sections, 8 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: An overview of Auffusion architecture. The whole training and inference process include back-and-forth transformation between four feature spaces: audio, spectrogram, pixel and latent space. Note that U-Net is initialized with pretrained text-to-image LDM.
  • Figure 2: The visualization of cross attention maps for Auffusion with different text encoders and Tango model. Auffusion-no-pretrain use fixed CLIP encoder and LDM is trained from scratch. The LDMs in 2 to 4 rows are initialized with SDv1.5 with different encoders. The last row shows the Tango's cross attention map, and Tango uses FlanT5-large as condition encoder.
  • Figure 3: Screenshot of subjective evaluation.
  • Figure 4: Demo of audio generation with the Auffusion-Full model.
  • Figure 5: Audio style transfer gradually from baby crying to cat meowing.
  • ...and 5 more figures