Table of Contents
Fetching ...

Mind the Prompt: Prompting Strategies in Audio Generations for Improving Sound Classification

Francesca Ronchini, Ho-Hsiang Wu, Wei-Cheng Lin, Fabio Antonacci

TL;DR

The paper addresses data scarcity and privacy concerns in sound classification by leveraging Text-To-Audio (TTA) models to generate synthetic datasets guided by varied prompt strategies. It defines three prompt strategies (Basic, Structured, Exemplar-based), uses two TTA models (AudioGen and Stable Audio Open), and evaluates on ESC50 and UrbanSound8K with CNN10, both for pure synthetic training and as augmentation. Key findings show that task-specific prompts yield substantial gains over basic prompts, and combining data from different prompts or different TTA models yields more robust improvements than simply increasing dataset size. The work provides practical guidance for synthetic data augmentation in audio and suggests future directions in fine-tuning, domain adaptation, and bias mitigation in LLM-generated captions.

Abstract

This paper investigates the design of effective prompt strategies for generating realistic datasets using Text-To-Audio (TTA) models. We also analyze different techniques for efficiently combining these datasets to enhance their utility in sound classification tasks. By evaluating two sound classification datasets with two TTA models, we apply a range of prompt strategies. Our findings reveal that task-specific prompt strategies significantly outperform basic prompt approaches in data generation. Furthermore, merging datasets generated using different TTA models proves to enhance classification performance more effectively than merely increasing the training dataset size. Overall, our results underscore the advantages of these methods as effective data augmentation techniques using synthetic data.

Mind the Prompt: Prompting Strategies in Audio Generations for Improving Sound Classification

TL;DR

The paper addresses data scarcity and privacy concerns in sound classification by leveraging Text-To-Audio (TTA) models to generate synthetic datasets guided by varied prompt strategies. It defines three prompt strategies (Basic, Structured, Exemplar-based), uses two TTA models (AudioGen and Stable Audio Open), and evaluates on ESC50 and UrbanSound8K with CNN10, both for pure synthetic training and as augmentation. Key findings show that task-specific prompts yield substantial gains over basic prompts, and combining data from different prompts or different TTA models yields more robust improvements than simply increasing dataset size. The work provides practical guidance for synthetic data augmentation in audio and suggests future directions in fine-tuning, domain adaptation, and bias mitigation in LLM-generated captions.

Abstract

This paper investigates the design of effective prompt strategies for generating realistic datasets using Text-To-Audio (TTA) models. We also analyze different techniques for efficiently combining these datasets to enhance their utility in sound classification tasks. By evaluating two sound classification datasets with two TTA models, we apply a range of prompt strategies. Our findings reveal that task-specific prompt strategies significantly outperform basic prompt approaches in data generation. Furthermore, merging datasets generated using different TTA models proves to enhance classification performance more effectively than merely increasing the training dataset size. Overall, our results underscore the advantages of these methods as effective data augmentation techniques using synthetic data.

Paper Structure

This paper contains 15 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Accuracy of CNN10 when trained with ESC50 (a) and US8K (b) TTA-generated datasets.
  • Figure 2: Accuracy of CNN10 when ESC50 (a) and US8K (b) TTA-generated datasets are used as a data augmentation technique.