Table of Contents
Fetching ...

YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls

Zihao Chen, Haomin Zhang, Xinhan Di, Haoyu Wang, Sizhe Shan, Junjie Zheng, Yunming Liang, Yihan Fan, Xinfa Zhu, Wenjie Tian, Yihua Wang, Chaofan Ding, Lei Xie

TL;DR

YingSound addresses the challenge of generating high-quality video-synchronized sound effects under few-shot data constraints by integrating a conditional flow matching transformer with a learnable audio-visual aggregator and a multi-modal chain-of-thought refinement module. The approach combines an AVA to fuse high-resolution visual features with audio across DiT layers, a three-stage training strategy to enable text-, video-, or video+text-guided generation, and a CoT-based refinement pipeline that uses rewards and an expert module to produce finer audio in few-shot settings. An industry-standard V2A dataset and a rigorous data-pipeline with human-in-the-loop annotation support robust, real-world training. Extensive experiments on the VGGSound-test dataset show strong semantic and temporal alignment, and high audio fidelity, demonstrating YingSound’s potential for Foley, gaming, and animation workflows. Overall, YingSound advances video-to-audio generation toward industrial deployment by delivering high-quality, synchronized sound effects with limited labeled data and a scalable training regimen.

Abstract

Generating sound effects for product-level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high-quality sounds in few-shot settings. To tackle the challenge of limited labeled data in real-world scenes, we introduce YingSound, a foundation model designed for video-guided sound generation that supports high-quality audio generation in few-shot settings. Specifically, YingSound consists of two major modules. The first module uses a conditional flow matching transformer to achieve effective semantic alignment in sound generation across audio and visual modalities. This module aims to build a learnable audio-visual aggregator (AVA) that integrates high-resolution visual features with corresponding audio features at multiple stages. The second module is developed with a proposed multi-modal visual-audio chain-of-thought (CoT) approach to generate finer sound effects in few-shot settings. Finally, an industry-standard video-to-audio (V2A) dataset that encompasses various real-world scenarios is presented. We show that YingSound effectively generates high-quality synchronized sounds across diverse conditional inputs through automated evaluations and human studies. Project Page: \url{https://giantailab.github.io/yingsound/}

YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls

TL;DR

YingSound addresses the challenge of generating high-quality video-synchronized sound effects under few-shot data constraints by integrating a conditional flow matching transformer with a learnable audio-visual aggregator and a multi-modal chain-of-thought refinement module. The approach combines an AVA to fuse high-resolution visual features with audio across DiT layers, a three-stage training strategy to enable text-, video-, or video+text-guided generation, and a CoT-based refinement pipeline that uses rewards and an expert module to produce finer audio in few-shot settings. An industry-standard V2A dataset and a rigorous data-pipeline with human-in-the-loop annotation support robust, real-world training. Extensive experiments on the VGGSound-test dataset show strong semantic and temporal alignment, and high audio fidelity, demonstrating YingSound’s potential for Foley, gaming, and animation workflows. Overall, YingSound advances video-to-audio generation toward industrial deployment by delivering high-quality, synchronized sound effects with limited labeled data and a scalable training regimen.

Abstract

Generating sound effects for product-level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high-quality sounds in few-shot settings. To tackle the challenge of limited labeled data in real-world scenes, we introduce YingSound, a foundation model designed for video-guided sound generation that supports high-quality audio generation in few-shot settings. Specifically, YingSound consists of two major modules. The first module uses a conditional flow matching transformer to achieve effective semantic alignment in sound generation across audio and visual modalities. This module aims to build a learnable audio-visual aggregator (AVA) that integrates high-resolution visual features with corresponding audio features at multiple stages. The second module is developed with a proposed multi-modal visual-audio chain-of-thought (CoT) approach to generate finer sound effects in few-shot settings. Finally, an industry-standard video-to-audio (V2A) dataset that encompasses various real-world scenarios is presented. We show that YingSound effectively generates high-quality synchronized sounds across diverse conditional inputs through automated evaluations and human studies. Project Page: \url{https://giantailab.github.io/yingsound/}

Paper Structure

This paper contains 23 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The data collection and processing pipeline with human-in-the-loop.
  • Figure 2: The overview of the YingSound. It comprises two key components: Conditional Flow Matching with Transformers and a Multi-modal Chain-of-Thought Based Audio Generation.
  • Figure 3: Temporal Alignment comparison.
  • Figure 4: Application visualization results of YingSound.