Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation

Chenxu Xiong; Ruibo Fu; Shuchen Shi; Zhengqi Wen; Jianhua Tao; Tao Wang; Chenxing Li; Chunyu Qiang; Yuankun Xie; Xin Qi; Guanjun Li; Zizheng Yang

Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation

Chenxu Xiong, Ruibo Fu, Shuchen Shi, Zhengqi Wen, Jianhua Tao, Tao Wang, Chenxing Li, Chunyu Qiang, Yuankun Xie, Xin Qi, Guanjun Li, Zizheng Yang

TL;DR

The paper addresses the limitation of text-only prompts in capturing nuanced multi-style audio by introducing the Sound Event Enhanced Prompt Adapter (SEPA), which fuses text with sound-event references via cross-attention to produce a style embedding applied through adaptive layer normalization in a latent diffusion audio generator. It also presents the SERST dataset to support dual-prompt style transfer. Empirical results show robust improvements over baselines in objective metrics (Fréchet Distance and KL divergence) and strong alignment to reference audio, with competitive perceptual quality. The approach enables finer-grained target-style audio generation and provides publicly available code, demo, and dataset, highlighting practical impact for content creation and audio synthesis tasks.

Abstract

Current mainstream audio generation methods primarily rely on simple text prompts, often failing to capture the nuanced details necessary for multi-style audio generation. To address this limitation, the Sound Event Enhanced Prompt Adapter is proposed. Unlike traditional static global style transfer, this method extracts style embedding through cross-attention between text and reference audio for adaptive style control. Adaptive layer normalization is then utilized to enhance the model's capacity to express multiple styles. Additionally, the Sound Event Reference Style Transfer Dataset (SERST) is introduced for the proposed target style audio generation task, enabling dual-prompt audio generation using both text and audio references. Experimental results demonstrate the robustness of the model, achieving state-of-the-art Fréchet Distance of 26.94 and KL Divergence of 1.82, surpassing Tango, AudioLDM, and AudioGen. Furthermore, the generated audio shows high similarity to its corresponding audio reference. The demo, code, and dataset are publicly available.

Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation

TL;DR

Abstract

Paper Structure (14 sections, 4 equations, 2 figures, 4 tables)

This paper contains 14 sections, 4 equations, 2 figures, 4 tables.

Introduction
Method
Sound Event Reference Style Transfer Dataset (SERST)
Sound Event Enhanced Prompt Adapter
Conditional Latent Diffusion Audio Generation Model
Experiments
Training Setting
Evaluation Metrics
Results and Analysis
Sensitivity Analysis for Sound Enhanced Prompt Adapter
Comparison of Generated Audio Accuracy with Baseline Models
Ablation Study of Text and Sound Event Prompt Fusion Methods
Style transfer performance evaluation by measuring audio similarity
Conclusion

Figures (2)

Figure 1: Example of Sound Event Enhanced Prompt Adapter, generating audio while preserving the style from sound event.
Figure 2: SEARST Dataset and Model Architecture: In the training stage, the latent diffusion model (LDM) is conditioned on an audio embedding learned in a continuous space through a variational auto-encoder (VAE). Text is fused with a randomly selected sound event reference through the Sound Event Enhanced Prompt Adapter to generate a style embedding. This style embedding is then utilized for adaptive layer normalization in the U-Net. In the inference stage, the LDM is conditioned on random noise instead of the audio embedding derived from the VAE.

Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation

TL;DR

Abstract

Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)