PPPR: Portable Plug-in Prompt Refiner for Text to Audio Generation
Shuchen Shi, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Tao Wang, Chunyu Qiang, Yi Lu, Xin Qi, Xuefei Liu, Yukun Liu, Yongwei Li, Zhiyong Wang, Xiaopeng Wang
TL;DR
This work tackles the limited diversity and accuracy of text descriptions in text-to-audio generation. It introduces Portable Plug-in Prompt Refiner (PPPR), a front-end framework with two modules: LLM-Based Text Description Active Augmentation to enrich training captions and CoT-Based Prompt Regularization to refine input descriptions at inference; it also incorporates an LLM-refined text encoder (FLAN-T5-LARGE) to condition a Latent Diffusion Model (LDM) for audio generation. PPPR demonstrates state-of-the-art Inception Score (IS) on AudioCaps (IS = 8.72) and improved subjective metrics (OVL, REL) by leveraging fine-grained description diversity and stepwise textual regularization. By boosting robustness of the acoustic model while keeping training data unchanged, PPPR has practical implications for high-fidelity, context-aware audio generation in AIGC applications, with potential extensions to joint TTA-TTS tasks.
Abstract
Text-to-Audio (TTA) aims to generate audio that corresponds to the given text description, playing a crucial role in media production. The text descriptions in TTA datasets lack rich variations and diversity, resulting in a drop in TTA model performance when faced with complex text. To address this issue, we propose a method called Portable Plug-in Prompt Refiner, which utilizes rich knowledge about textual descriptions inherent in large language models to effectively enhance the robustness of TTA acoustic models without altering the acoustic training set. Furthermore, a Chain-of-Thought that mimics human verification is introduced to enhance the accuracy of audio descriptions, thereby improving the accuracy of generated content in practical applications. The experiments show that our method achieves a state-of-the-art Inception Score (IS) of 8.72, surpassing AudioGen, AudioLDM and Tango.
