Table of Contents
Fetching ...

Improving Text-To-Audio Models with Synthetic Captions

Zhifeng Kong, Sang-gil Lee, Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Rafael Valle, Soujanya Poria, Bryan Catanzaro

TL;DR

Scarcity of high-quality audio-caption data constrains text-to-audio (TTA) models. The authors introduce AF-AudioSet, a large synthetic-caption dataset generated by an audio-language model (Audio Flamingo chat) and filtered via CLAP similarity, to enable scalable pretraining. Pretraining Tango on AF-AudioSet yields state-of-the-art results on AudioCaps and MusicCaps, with systematic analysis of caption quality vs. data size and the benefits of mixing synthetic with real captions. This approach demonstrates that high-quality synthetic captions can substantially boost TTA and TTA-based music tasks, offering scalable data augmentation for multimodal models.

Abstract

It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged \textit{text-only language models} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio language model} to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named \texttt{AF-AudioSet}, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new \textit{state-of-the-art}.

Improving Text-To-Audio Models with Synthetic Captions

TL;DR

Scarcity of high-quality audio-caption data constrains text-to-audio (TTA) models. The authors introduce AF-AudioSet, a large synthetic-caption dataset generated by an audio-language model (Audio Flamingo chat) and filtered via CLAP similarity, to enable scalable pretraining. Pretraining Tango on AF-AudioSet yields state-of-the-art results on AudioCaps and MusicCaps, with systematic analysis of caption quality vs. data size and the benefits of mixing synthetic with real captions. This approach demonstrates that high-quality synthetic captions can substantially boost TTA and TTA-based music tasks, offering scalable data augmentation for multimodal models.

Abstract

It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged \textit{text-only language models} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio language model} to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named \texttt{AF-AudioSet}, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new \textit{state-of-the-art}.
Paper Structure (15 sections, 6 figures, 3 tables)

This paper contains 15 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Distribution of sound types in AF-AudioSet.
  • Figure 2: Evaluation results on AudioCaps with different CLAP thresholds of AF-AudioSet. The model is Tango (medium) finetuned on AudioCaps. $\tau=\mathbf{0.45}$ leads to the best results overall and significant improvements over the non-pretrained one.
  • Figure 3: Evaluation results on MusicCaps with different CLAP thresholds of AF-AudioSet. The model is Tango (medium) finetuned on MusicCaps. $\tau=\mathbf{0.35}$ leads to the best FD and FAD and significant improvements over the non-pretrained one.
  • Figure 4: Evaluation results on AudioCaps with different CLAP thresholds of AF-AudioSet. The model is Tango-CLAP (medium) finetuned on AudioCaps. The results are similar to Tango in Figure \ref{['fig: Tango AudioCaps medium']}.
  • Figure 5: Evaluation results on AudioCaps with different model sizes. The model is Tango pre-trained on AF-AudioSet with $\tau=0.45$ and finetuned on AudioCaps. The improvement by pretraining is clear across all model sizes.
  • ...and 1 more figures