Table of Contents
Fetching ...

ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

Huadai Liu, Kaicheng Luo, Jialei Wang, Wen Wang, Qian Chen, Zhou Zhao, Wei Xue

TL;DR

ThinkSound introduces a CoT-guided, three-stage framework for video-to-audio generation and editing that leverages fine-tuned multimodal LLMs to produce structured reasoning used to drive a unified, flow-matching audio foundation model. The approach enables semantic Foley generation, interactive object-focused refinement, and natural language instruction-based editing, all underpinned by AudioCoT, a dataset of CoT-annotated audio-visual data. Empirical results show state-of-the-art performance on standard V2A benchmarks and robust out-of-distribution generalization, with ablations confirming the critical role of CoT structure and reasoning. The work offers a path toward more controllable and semantically grounded audio synthesis for multimedia applications, while acknowledging ethical considerations and the need for diverse data and robust safeguards.

Abstract

While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, this generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics, and excels in the out-of-distribution Movie Gen Audio benchmark. The project page is available at https://ThinkSound-Project.github.io.

ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

TL;DR

ThinkSound introduces a CoT-guided, three-stage framework for video-to-audio generation and editing that leverages fine-tuned multimodal LLMs to produce structured reasoning used to drive a unified, flow-matching audio foundation model. The approach enables semantic Foley generation, interactive object-focused refinement, and natural language instruction-based editing, all underpinned by AudioCoT, a dataset of CoT-annotated audio-visual data. Empirical results show state-of-the-art performance on standard V2A benchmarks and robust out-of-distribution generalization, with ablations confirming the critical role of CoT structure and reasoning. The work offers a path toward more controllable and semantically grounded audio synthesis for multimedia applications, while acknowledging ethical considerations and the need for diverse data and robust safeguards.

Abstract

While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, this generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics, and excels in the out-of-distribution Movie Gen Audio benchmark. The project page is available at https://ThinkSound-Project.github.io.

Paper Structure

This paper contains 58 sections, 5 figures, 14 tables.

Figures (5)

  • Figure 1: ThinkSound with CoT: (1) CoT-driven foley synthesis captures semantic and temporal details (2) interactive object-centric refinement for control (3) targeted editing.
  • Figure 2: Overview of our AudioCoT dataset construction pipeline.
  • Figure 3: Overview of the ThinkSound architecture. Left: our Multimodal LLM framework, where a fine-tuned VideoLLaMA 2 model generates CoT reasoning for audio generation and editing. Right: our enhanced Multimodal Transformer architecture, which features an MM-DiT backbone with dedicated pathways for processing multimodal inputs and CoT-driven conditioning to enable high-fidelity, contextually grounded audio generation.
  • Figure 4: Qualitative Comparison: Left: Spectrograms for a car door movement sequence (closed ‚Üí opened ‚Üí closed), showing ThinkSound‚Äôs precise alignment of each door sound versus the baseline‚Äôs premature opening effect. Right: Spectrograms for a grassy-field pheasant scene (ambient bird calls ‚Üí wing-flap chirp ‚Üí ambient calls), illustrating ThinkSound‚Äôs accurate detection and timing of the transient chirp compared to the baseline‚Äôs omission or delay.
  • Figure 5: Multi‐stream blocks: $F_v$ is the video features, $F_t$ is the text features, $x_t$ is the audio latents, and $c_g$ denotes the global condition.