Table of Contents
Fetching ...

Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions

Yi Yuan, Dongya Jia, Xiaobin Zhuang, Yuanzhe Chen, Zhengxi Liu, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xubo Liu, Xiyuan Kang, Mark D. Plumbley, Wenwu Wang

TL;DR

This work tackles the difficulty of following complex prompts in text-to-audio generation caused by sparse or shallow captions. It introduces Sound-VECaps, a 1.66M audio-caption dataset enhanced with visual information by fusing video captions, audio captions, and AudioSet labels via an LLM, and provides two caption variants to suit different tasks. Training diffusion-based audio generation models on Sound-VECaps yields state-of-the-art results on AudioCaps and strong performance on enriched benchmarks like AudioCaps-Enhanced, while ablations reveal both the benefits and caveats of incorporating visual content in captions. The work also demonstrates improved audio-language retrieval and temporal feature understanding, underscoring the practical impact of richer, vision-assisted captions for multimodal audio-language learning.

Abstract

Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. We hypothesize that this problem stems from the simplicity and scarcity of the training data. This work aims to create a large-scale audio dataset with rich captions for improving audio generation models. We first develop an automated pipeline to generate detailed captions by transforming predicted visual captions, audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM). The resulting dataset, Sound-VECaps, comprises 1.66M high-quality audio-caption pairs with enriched details including audio event orders, occurred places and environment information. We then demonstrate that training the text-to-audio generation models with Sound-VECaps significantly improves the performance on complex prompts. Furthermore, we conduct ablation studies of the models on several downstream audio-language tasks, showing the potential of Sound-VECaps in advancing audio-text representation learning. Our dataset and models are available online from here https://yyua8222.github.io/Sound-VECaps-demo/.

Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions

TL;DR

This work tackles the difficulty of following complex prompts in text-to-audio generation caused by sparse or shallow captions. It introduces Sound-VECaps, a 1.66M audio-caption dataset enhanced with visual information by fusing video captions, audio captions, and AudioSet labels via an LLM, and provides two caption variants to suit different tasks. Training diffusion-based audio generation models on Sound-VECaps yields state-of-the-art results on AudioCaps and strong performance on enriched benchmarks like AudioCaps-Enhanced, while ablations reveal both the benefits and caveats of incorporating visual content in captions. The work also demonstrates improved audio-language retrieval and temporal feature understanding, underscoring the practical impact of richer, vision-assisted captions for multimodal audio-language learning.

Abstract

Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. We hypothesize that this problem stems from the simplicity and scarcity of the training data. This work aims to create a large-scale audio dataset with rich captions for improving audio generation models. We first develop an automated pipeline to generate detailed captions by transforming predicted visual captions, audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM). The resulting dataset, Sound-VECaps, comprises 1.66M high-quality audio-caption pairs with enriched details including audio event orders, occurred places and environment information. We then demonstrate that training the text-to-audio generation models with Sound-VECaps significantly improves the performance on complex prompts. Furthermore, we conduct ablation studies of the models on several downstream audio-language tasks, showing the potential of Sound-VECaps in advancing audio-text representation learning. Our dataset and models are available online from here https://yyua8222.github.io/Sound-VECaps-demo/.
Paper Structure (16 sections, 2 figures, 7 tables)

This paper contains 16 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: The caption generation pipeline of the Sound-VECaps
  • Figure 2: The prompts used for caption generation, where the contents in green section are used for full feature captions and red sections are applied to avoid any visual-only contents.