Synthetic training set generation using text-to-audio models for environmental sound classification

Francesca Ronchini; Luca Comanducci; Fabio Antonacci

Synthetic training set generation using text-to-audio models for environmental sound classification

Francesca Ronchini, Luca Comanducci, Fabio Antonacci

TL;DR

The paper tackles environmental sound classification with limited labeled data by exploring synthetic datasets generated via text-to-audio models. It assesses three training strategies—augmentation with TTA data, mixed real and synthetic data, and training on synthetic data only—using CNN and CRNN ESC models. Findings show that TTA-based augmentation consistently improves accuracy over traditional signal-processing augmentation, while training solely on synthetic audio underperforms the real-data baseline; partial replacement of real data with synthetic samples is feasible up to about 20% (and up to ~40% for AudioGengpt), after which performance degrades. The results motivate further work on prompt design and fine-tuning of TTA models to close the gap between synthetic and real data.

Abstract

In recent years, text-to-audio models have revolutionized the field of automatic audio generation. This paper investigates their application in generating synthetic datasets for training data-driven models. Specifically, this study analyzes the performance of two environmental sound classification systems trained with data generated from text-to-audio models. We considered three scenarios: a) augmenting the training dataset with data generated by text-to-audio models; b) using a mixed training dataset combining real and synthetic text-driven generated data; and c) using a training dataset composed entirely of synthetic audio. In all cases, the performance of the classification models was tested on real data. Results indicate that text-to-audio models are effective for dataset augmentation, with consistent performance when replacing a subset of the recorded dataset. However, the performance of the audio recognition models drops when relying entirely on generated audio.

Synthetic training set generation using text-to-audio models for environmental sound classification

TL;DR

Abstract

Paper Structure (12 sections, 4 figures, 2 tables)

This paper contains 12 sections, 4 figures, 2 tables.

Introduction
Experimental procedure
Text-to-Audio models
Synthetic dataset generation process
Prompt templates
Model Architectures
Experiments and results
Can TTA-augmented datasets increase the accuracy of ESC models?
Can we rely on only TTA-generated data to train an ESC system?
To what extent real data can be safely replaced by synthetic data generated through TTA models?
Discussion
Conclusions and future works

Figures (4)

Figure 1: US8K dataset classes distribution per each fold. Colors represent the different sound classes, specified in the legend.
Figure 2: Classification accuracy when varying the size of the TTA-generated augmentation dataset. Error bars represent 95% confidence intervals over $5$ runs of the experiment.
Figure 3: Classification accuracy when varying the size of the training dataset composed of only TTA-generated data. Error bars: 95% confidence intervals over $5$ experiment repetitions.
Figure 4: Classification accuracy when incrementally replacing US8K folders using TTA-generated data. Error bars: 95% confidence intervals over 5 experiment repetitions.

Synthetic training set generation using text-to-audio models for environmental sound classification

TL;DR

Abstract

Synthetic training set generation using text-to-audio models for environmental sound classification

Authors

TL;DR

Abstract

Table of Contents

Figures (4)