Table of Contents
Fetching ...

Leveraging Pre-trained AudioLDM for Sound Generation: A Benchmark Study

Yi Yuan, Haohe Liu, Jinhua Liang, Xubo Liu, Mark D. Plumbley, Wenwu Wang

TL;DR

This work investigates the benefits of pre-training a state-of-the-art audio generator, AudioLDM, for text-to-sound tasks, especially under data-scarce conditions. It presents a standardized benchmark across four datasets and multiple metrics (FD, IS, FAD, KL) to enable fair comparisons and transfer-learning assessments. Through both baselines and fine-tuning studies, the paper shows that pre-trained AudioLDM can improve sample quality and training efficiency, with text embedding often offering better regularization on small datasets while audio embedding can enable faster convergence in some cases. The resulting benchmark and findings provide practical guidance for deploying pre-trained audio generation systems in data-limited scenarios.

Abstract

Deep neural networks have recently achieved breakthroughs in sound generation. Despite the outstanding sample quality, current sound generation models face issues on small-scale datasets (e.g., overfitting), significantly limiting performance. In this paper, we make the first attempt to investigate the benefits of pre-training on sound generation with AudioLDM, the cutting-edge model for audio generation, as the backbone. Our study demonstrates the advantages of the pre-trained AudioLDM, especially in data-scarcity scenarios. In addition, the baselines and evaluation protocol for sound generation systems are not consistent enough to compare different studies directly. Aiming to facilitate further study on sound generation tasks, we benchmark the sound generation task on various frequently-used datasets. We hope our results on transfer learning and benchmarks can provide references for further research on conditional sound generation.

Leveraging Pre-trained AudioLDM for Sound Generation: A Benchmark Study

TL;DR

This work investigates the benefits of pre-training a state-of-the-art audio generator, AudioLDM, for text-to-sound tasks, especially under data-scarce conditions. It presents a standardized benchmark across four datasets and multiple metrics (FD, IS, FAD, KL) to enable fair comparisons and transfer-learning assessments. Through both baselines and fine-tuning studies, the paper shows that pre-trained AudioLDM can improve sample quality and training efficiency, with text embedding often offering better regularization on small datasets while audio embedding can enable faster convergence in some cases. The resulting benchmark and findings provide practical guidance for deploying pre-trained audio generation systems in data-limited scenarios.

Abstract

Deep neural networks have recently achieved breakthroughs in sound generation. Despite the outstanding sample quality, current sound generation models face issues on small-scale datasets (e.g., overfitting), significantly limiting performance. In this paper, we make the first attempt to investigate the benefits of pre-training on sound generation with AudioLDM, the cutting-edge model for audio generation, as the backbone. Our study demonstrates the advantages of the pre-trained AudioLDM, especially in data-scarcity scenarios. In addition, the baselines and evaluation protocol for sound generation systems are not consistent enough to compare different studies directly. Aiming to facilitate further study on sound generation tasks, we benchmark the sound generation task on various frequently-used datasets. We hope our results on transfer learning and benchmarks can provide references for further research on conditional sound generation.
Paper Structure (11 sections, 2 equations, 2 figures, 2 tables)

This paper contains 11 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The metrics are evaluated with the increase of the percentage (from 0 to 1) of pre-processed data on: 1) adding noise on the mel-spectrogram; 2) masking value on the mel-spectrogram; 3) making disorder sound events; 4) adding interfering sound events. Regarding metrics capabilities, higher IS and lower KL, FD, and FAD indicate better sample quality.
  • Figure 2: The performance of AudioLDM on ESC50 as a function of thousand training steps. Four curves show AudioLDM optimized with 1) audio embeddings; 2) text embeddings; 3) text embedding with pre-trained parameters; and 4) audio embedding with pre-trained parameters