Table of Contents
Fetching ...

DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

Yi Yuan, Xubo Liu, Haohe Liu, Xiyuan Kang, Zhuo Chen, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang

Abstract

With the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models mainly aim to generate semantically aligned sound and fall short of controlling fine-grained acoustic characteristics of specific sounds. As a result, users who need specific sound content may find it difficult to generate the desired audio clips. In this paper, we present DreamAudio for customized text-to-audio generation (CTTA). Specifically, we introduce a new framework that is designed to enable the model to identify auditory information from user-provided reference concepts for audio generation. Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events. In addition, two types of datasets are developed for training and testing the proposed systems. The experiments show that DreamAudio generates audio samples that are highly consistent with the customized audio features and aligned well with the input text prompts. Furthermore, DreamAudio offers comparable performance in general text-to-audio tasks. We also provide a human-involved dataset containing audio events from real-world CTTA cases as the benchmark for customized generation tasks.

DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

Abstract

With the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models mainly aim to generate semantically aligned sound and fall short of controlling fine-grained acoustic characteristics of specific sounds. As a result, users who need specific sound content may find it difficult to generate the desired audio clips. In this paper, we present DreamAudio for customized text-to-audio generation (CTTA). Specifically, we introduce a new framework that is designed to enable the model to identify auditory information from user-provided reference concepts for audio generation. Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events. In addition, two types of datasets are developed for training and testing the proposed systems. The experiments show that DreamAudio generates audio samples that are highly consistent with the customized audio features and aligned well with the input text prompts. Furthermore, DreamAudio offers comparable performance in general text-to-audio tasks. We also provide a human-involved dataset containing audio events from real-world CTTA cases as the benchmark for customized generation tasks.

Paper Structure

This paper contains 55 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: An illustration of DreamAudio for audio generation with customized content of "monster fighting" and "Minion talking". The system takes both the text prompt and user-provided audio-caption pairs as the reference concepts, and generate audio content consistent with the description "Monster is fighting with a Minion".
  • Figure 2: The inference pipeline of the DreamAudio. The input prompt and reference concept are encoded in two paralleled paths through the Flan-T5 Encoder and the reference audio feature is encoded by the VAE Encoder. Along with the noisy data $\boldsymbol{z}_{\lambda}$, four inputs are forwarded to the generator with the MRC structure to generate the denoised data, followed by the VAE decoder and vocoder to reconstruct the final output waveform.
  • Figure 3: The details of the MRC module, which takes the reference feature $\boldsymbol{R}$ and $\boldsymbol{E}$, prompt feature $\boldsymbol{C}$ and the current noisy data $\boldsymbol{z}_{\lambda}$ as inputs to generate the dynamics for denoised data $\boldsymbol{z}_{1}$ on step $\lambda$. The output $\mu(\cdot)$ can then be used for both training and inference.
  • Figure 4: The generation pipeline of the customized datasets, with the Customized-Concatenation on the left and Customized-Overlay on the right. All the clips are selected randomly from the base dataset and both the concatenation clips and overlapped clips are fixed into 10-seconds.
  • Figure 5: The details of the MRC UNet network for reference length fine-tuning. All the existing blocks are frozen and only the introduced CNN alignment layer is trained during this stage.