Table of Contents
Fetching ...

Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

Sreyan Ghosh, Sonal Kumar, Zhifeng Kong, Rafael Valle, Bryan Catanzaro, Dinesh Manocha

TL;DR

Synthio addresses the challenge of limited labeled data for audio classification by coupling a text-to-audio diffusion model with preference-based alignment and language-guided data generation. A two-stage pipeline first aligns the T2A model to the small dataset using Diffusion Policy Optimization (DPO), then generates diverse synthetic augmentations via language-guided imagination (MixCap) and a self-reflection loop with CLAP filtering. Across ten datasets and four low-resource settings, Synthio consistently outperforms baselines by 0.1% to 39%, with strong gains in long-tailed categories and clear scalability when increasing augmentation factor $N$. The work demonstrates that coupling distribution-aligned generation with rich caption-based prompts yields practical, scalable improvements for data-efficient audio understanding, albeit with computational and model-quality considerations for future work.

Abstract

We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real-world audios. To address this shortcoming, we propose to augment the dataset with synthetic audio generated from text-to-audio (T2A) diffusion models. However, synthesizing effective augmentations is challenging because not only should the generated data be acoustically consistent with the underlying small-scale dataset, but they should also have sufficient compositional diversity. To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization. This ensures that the acoustic characteristics of the generated data remain consistent with the small-scale dataset. To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models to (1) generate diverse and meaningful audio captions and (2) iteratively refine their quality. The generated captions are then used to prompt the aligned T2A model. We extensively evaluate Synthio on ten datasets and four simulated limited-data settings. Results indicate our method consistently outperforms all baselines by 0.1%-39% using a T2A model trained only on weakly-captioned AudioSet.

Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

TL;DR

Synthio addresses the challenge of limited labeled data for audio classification by coupling a text-to-audio diffusion model with preference-based alignment and language-guided data generation. A two-stage pipeline first aligns the T2A model to the small dataset using Diffusion Policy Optimization (DPO), then generates diverse synthetic augmentations via language-guided imagination (MixCap) and a self-reflection loop with CLAP filtering. Across ten datasets and four low-resource settings, Synthio consistently outperforms baselines by 0.1% to 39%, with strong gains in long-tailed categories and clear scalability when increasing augmentation factor . The work demonstrates that coupling distribution-aligned generation with rich caption-based prompts yields practical, scalable improvements for data-efficient audio understanding, albeit with computational and model-quality considerations for future work.

Abstract

We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real-world audios. To address this shortcoming, we propose to augment the dataset with synthetic audio generated from text-to-audio (T2A) diffusion models. However, synthesizing effective augmentations is challenging because not only should the generated data be acoustically consistent with the underlying small-scale dataset, but they should also have sufficient compositional diversity. To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization. This ensures that the acoustic characteristics of the generated data remain consistent with the small-scale dataset. To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models to (1) generate diverse and meaningful audio captions and (2) iteratively refine their quality. The generated captions are then used to prompt the aligned T2A model. We extensively evaluate Synthio on ten datasets and four simulated limited-data settings. Results indicate our method consistently outperforms all baselines by 0.1%-39% using a T2A model trained only on weakly-captioned AudioSet.
Paper Structure (31 sections, 17 equations, 9 figures, 14 tables, 1 algorithm)

This paper contains 31 sections, 17 equations, 9 figures, 14 tables, 1 algorithm.

Figures (9)

  • Figure 1: Performance comparison of Synthio with other augmentation methods on down-sampled ESC-50 (100 samples). Traditional augmentation, such as SpecAug, degrades performance on small-scale datasets. Naive synthetic augmentation outperforms traditional methods significantly but plateaus with higher sample counts. Synthio further enhances performance by generating consistent and diverse synthetic data.
  • Figure 2: We propose to align the T2A model $\mathcal{T}^\theta$ with the small-scale dataset $\mathcal{D}_\text{small}$ using DPO. This helps us generate audios with acoustic characteristics aligned to that of $\mathcal{D}_\text{small}$.
  • Figure 3: Overview of our proposed Language-Guided Audio Imagination for generating diverse synthetic augmentations. Starting with the small-scale dataset, we first generate audio captions and use an LLM to extract acoustic components (Prompt 1). Using these components and audio labels, we prompt the LLM to generate new and diverse captions (Prompt 2), which are then used to prompt the aligned T2A model for audio generation. The generated audios are filtered for label consistency using CLAP, with accepted audios added to the final synthetic dataset. Rejected audios undergo caption revision (Prompt 3) through a self-reflection process, and the revised captions are used to regenerate audios, iterating this process $i$ times. Example captions are in Table \ref{['tab:example_captions_classification']}.
  • Figure 4: Comparison of spectral and pitch features between generated audios in $\mathcal{D}_\text{syn}$ and real audios in $\mathcal{D}_\text{small}$ (for $n$ = 100). Synthio-generated audios closely replicate the features of real data, demonstrating its ability to produce augmentations that maintain consistency with the original dataset (also see FAD scores in Sec. \ref{['subsubsec:fad']}).
  • Figure 5: Category-wise improvement in performance with Synthio augmentations for long-tailed categories.
  • ...and 4 more figures