Table of Contents
Fetching ...

Synthetic training set generation using text-to-audio models for environmental sound classification

Francesca Ronchini, Luca Comanducci, Fabio Antonacci

TL;DR

The paper tackles environmental sound classification with limited labeled data by exploring synthetic datasets generated via text-to-audio models. It assesses three training strategies—augmentation with TTA data, mixed real and synthetic data, and training on synthetic data only—using CNN and CRNN ESC models. Findings show that TTA-based augmentation consistently improves accuracy over traditional signal-processing augmentation, while training solely on synthetic audio underperforms the real-data baseline; partial replacement of real data with synthetic samples is feasible up to about 20% (and up to ~40% for AudioGengpt), after which performance degrades. The results motivate further work on prompt design and fine-tuning of TTA models to close the gap between synthetic and real data.

Abstract

In recent years, text-to-audio models have revolutionized the field of automatic audio generation. This paper investigates their application in generating synthetic datasets for training data-driven models. Specifically, this study analyzes the performance of two environmental sound classification systems trained with data generated from text-to-audio models. We considered three scenarios: a) augmenting the training dataset with data generated by text-to-audio models; b) using a mixed training dataset combining real and synthetic text-driven generated data; and c) using a training dataset composed entirely of synthetic audio. In all cases, the performance of the classification models was tested on real data. Results indicate that text-to-audio models are effective for dataset augmentation, with consistent performance when replacing a subset of the recorded dataset. However, the performance of the audio recognition models drops when relying entirely on generated audio.

Synthetic training set generation using text-to-audio models for environmental sound classification

TL;DR

The paper tackles environmental sound classification with limited labeled data by exploring synthetic datasets generated via text-to-audio models. It assesses three training strategies—augmentation with TTA data, mixed real and synthetic data, and training on synthetic data only—using CNN and CRNN ESC models. Findings show that TTA-based augmentation consistently improves accuracy over traditional signal-processing augmentation, while training solely on synthetic audio underperforms the real-data baseline; partial replacement of real data with synthetic samples is feasible up to about 20% (and up to ~40% for AudioGengpt), after which performance degrades. The results motivate further work on prompt design and fine-tuning of TTA models to close the gap between synthetic and real data.

Abstract

In recent years, text-to-audio models have revolutionized the field of automatic audio generation. This paper investigates their application in generating synthetic datasets for training data-driven models. Specifically, this study analyzes the performance of two environmental sound classification systems trained with data generated from text-to-audio models. We considered three scenarios: a) augmenting the training dataset with data generated by text-to-audio models; b) using a mixed training dataset combining real and synthetic text-driven generated data; and c) using a training dataset composed entirely of synthetic audio. In all cases, the performance of the classification models was tested on real data. Results indicate that text-to-audio models are effective for dataset augmentation, with consistent performance when replacing a subset of the recorded dataset. However, the performance of the audio recognition models drops when relying entirely on generated audio.
Paper Structure (12 sections, 4 figures, 2 tables)

This paper contains 12 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: US8K dataset classes distribution per each fold. Colors represent the different sound classes, specified in the legend.
  • Figure 2: Classification accuracy when varying the size of the TTA-generated augmentation dataset. Error bars represent 95% confidence intervals over $5$ runs of the experiment.
  • Figure 3: Classification accuracy when varying the size of the training dataset composed of only TTA-generated data. Error bars: 95% confidence intervals over $5$ experiment repetitions.
  • Figure 4: Classification accuracy when incrementally replacing US8K folders using TTA-generated data. Error bars: 95% confidence intervals over 5 experiment repetitions.