Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data
Sreyan Ghosh, Sonal Kumar, Zhifeng Kong, Rafael Valle, Bryan Catanzaro, Dinesh Manocha
TL;DR
Synthio addresses the challenge of limited labeled data for audio classification by coupling a text-to-audio diffusion model with preference-based alignment and language-guided data generation. A two-stage pipeline first aligns the T2A model to the small dataset using Diffusion Policy Optimization (DPO), then generates diverse synthetic augmentations via language-guided imagination (MixCap) and a self-reflection loop with CLAP filtering. Across ten datasets and four low-resource settings, Synthio consistently outperforms baselines by 0.1% to 39%, with strong gains in long-tailed categories and clear scalability when increasing augmentation factor $N$. The work demonstrates that coupling distribution-aligned generation with rich caption-based prompts yields practical, scalable improvements for data-efficient audio understanding, albeit with computational and model-quality considerations for future work.
Abstract
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real-world audios. To address this shortcoming, we propose to augment the dataset with synthetic audio generated from text-to-audio (T2A) diffusion models. However, synthesizing effective augmentations is challenging because not only should the generated data be acoustically consistent with the underlying small-scale dataset, but they should also have sufficient compositional diversity. To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization. This ensures that the acoustic characteristics of the generated data remain consistent with the small-scale dataset. To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models to (1) generate diverse and meaningful audio captions and (2) iteratively refine their quality. The generated captions are then used to prompt the aligned T2A model. We extensively evaluate Synthio on ten datasets and four simulated limited-data settings. Results indicate our method consistently outperforms all baselines by 0.1%-39% using a T2A model trained only on weakly-captioned AudioSet.
