Leveraging Pre-trained AudioLDM for Sound Generation: A Benchmark Study
Yi Yuan, Haohe Liu, Jinhua Liang, Xubo Liu, Mark D. Plumbley, Wenwu Wang
TL;DR
This work investigates the benefits of pre-training a state-of-the-art audio generator, AudioLDM, for text-to-sound tasks, especially under data-scarce conditions. It presents a standardized benchmark across four datasets and multiple metrics (FD, IS, FAD, KL) to enable fair comparisons and transfer-learning assessments. Through both baselines and fine-tuning studies, the paper shows that pre-trained AudioLDM can improve sample quality and training efficiency, with text embedding often offering better regularization on small datasets while audio embedding can enable faster convergence in some cases. The resulting benchmark and findings provide practical guidance for deploying pre-trained audio generation systems in data-limited scenarios.
Abstract
Deep neural networks have recently achieved breakthroughs in sound generation. Despite the outstanding sample quality, current sound generation models face issues on small-scale datasets (e.g., overfitting), significantly limiting performance. In this paper, we make the first attempt to investigate the benefits of pre-training on sound generation with AudioLDM, the cutting-edge model for audio generation, as the backbone. Our study demonstrates the advantages of the pre-trained AudioLDM, especially in data-scarcity scenarios. In addition, the baselines and evaluation protocol for sound generation systems are not consistent enough to compare different studies directly. Aiming to facilitate further study on sound generation tasks, we benchmark the sound generation task on various frequently-used datasets. We hope our results on transfer learning and benchmarks can provide references for further research on conditional sound generation.
