Table of Contents
Fetching ...

Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling?

Tiantian Feng, Dimitrios Dimitriadis, Shrikanth Narayanan

TL;DR

The paper investigates whether synthetic audio from generative foundation models can serve as effective training data for audio recognition and speech modeling. It compares label-guided and LLM-assisted prompts across three models (AUDIOGEN, AudioLDM2, MusicGen) and analyzes zero-shot performance, mix-training gains, and data-centric augmentation strategies. Key findings show that LLM-assisted prompts improve zero-shot accuracy, mixed training helps when real data is limited, and synthetic audio can meaningfully augment speech-related tasks and robustness to noise. The results support adopting synthetic audio as a practical data source for improving recognition and robustness, with future work focusing on multimodal extensions and domain mismatch mitigation.

Abstract

Recent advances in foundation models have enabled audio-generative models that produce high-fidelity sounds associated with music, events, and human actions. Despite the success achieved in modern audio-generative models, the conventional approach to assessing the quality of the audio generation relies heavily on distance metrics like Frechet Audio Distance. In contrast, we aim to evaluate the quality of audio generation by examining the effectiveness of using them as training data. Specifically, we conduct studies to explore the use of synthetic audio for audio recognition. Moreover, we investigate whether synthetic audio can serve as a resource for data augmentation in speech-related modeling. Our comprehensive experiments demonstrate the potential of using synthetic audio for audio recognition and speech-related modeling. Our code is available at https://github.com/usc-sail/SynthAudio.

Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling?

TL;DR

The paper investigates whether synthetic audio from generative foundation models can serve as effective training data for audio recognition and speech modeling. It compares label-guided and LLM-assisted prompts across three models (AUDIOGEN, AudioLDM2, MusicGen) and analyzes zero-shot performance, mix-training gains, and data-centric augmentation strategies. Key findings show that LLM-assisted prompts improve zero-shot accuracy, mixed training helps when real data is limited, and synthetic audio can meaningfully augment speech-related tasks and robustness to noise. The results support adopting synthetic audio as a practical data source for improving recognition and robustness, with future work focusing on multimodal extensions and domain mismatch mitigation.

Abstract

Recent advances in foundation models have enabled audio-generative models that produce high-fidelity sounds associated with music, events, and human actions. Despite the success achieved in modern audio-generative models, the conventional approach to assessing the quality of the audio generation relies heavily on distance metrics like Frechet Audio Distance. In contrast, we aim to evaluate the quality of audio generation by examining the effectiveness of using them as training data. Specifically, we conduct studies to explore the use of synthetic audio for audio recognition. Moreover, we investigate whether synthetic audio can serve as a resource for data augmentation in speech-related modeling. Our comprehensive experiments demonstrate the potential of using synthetic audio for audio recognition and speech-related modeling. Our code is available at https://github.com/usc-sail/SynthAudio.
Paper Structure (19 sections, 5 figures, 5 tables)

This paper contains 19 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The audio generation approaches in this work: label-guided and LLM-assisted prompts. The label-guided prompt creates the prompt based on the label names, while the LLM-assisted prompt augments the label names with sound descriptions using LLMs. We use images from https://openmoji.org/
  • Figure 2: Comparisons of zero-shot audio recognition varying generation quantities. We study the generation quantity of approximately 1x, 3x, and 5x of the original training audio size.
  • Figure 3: Performance comparisons between real audio training and mixed training at different read data percentages.
  • Figure 4: Class-wise precision of the zero-shot audio recognition on UCF101 data.
  • Figure 5: Performances of speech training against real audio noises at 5, 10, and 20dB with different augmentation sources.