Table of Contents
Fetching ...

Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference Optimization

Haomin Zhang, Sizhe Shan, Haoyu Wang, Zihao Chen, Xiulong Liu, Chaofan Ding, Xinhan Di

TL;DR

This work addresses the challenge of producing high-quality video-to-audio and professional audio by introducing Chain-of-Perform (CoP) guidance in a three-stage, multi-modal framework built on flow-matching transformers. Stage 1 trains a general V2A/T2A foundation, Stage 2 adds piano-specific modules (Extra-DiT and Roll Predictor) for professional audio, and Stage 3 applies contrastive learning and Direct Preference Optimization to refine cross-modal alignment and stylistic fidelity. A new Piano-10h CoP dataset provides step-by-step MIDI-grounded guidance to support stepwise piano audio generation. Empirical results show improvements across general audio metrics (FAD, CLIP, AV-align) and piano-specific metrics (SI-SDR, MOS, MIDI precision/recall/F1), demonstrating the framework’s effectiveness in both broad and specialized audio generation tasks with enhanced semantic/temporal alignment and controllable style.

Abstract

Creating high-quality sound effects from videos and text prompts requires precise alignment between visual and audio domains, both semantically and temporally, along with step-by-step guidance for professional audio generation. However, current state-of-the-art video-guided audio generation models often fall short of producing high-quality audio for both general and specialized use cases. To address this challenge, we introduce a multi-stage, multi-modal, end-to-end generative framework with Chain-of-Thought-like (CoT-like) guidance learning, termed Chain-of-Perform (CoP). First, we employ a transformer-based network architecture designed to achieve CoP guidance, enabling the generation of both general and professional audio. Second, we implement a multi-stage training framework that follows step-by-step guidance to ensure the generation of high-quality sound effects. Third, we develop a CoP multi-modal dataset, guided by video, to support step-by-step sound effects generation. Evaluation results highlight the advantages of the proposed multi-stage CoP generative framework compared to the state-of-the-art models on a variety of datasets, with FAD 0.79 to 0.74 (+6.33%), CLIP 16.12 to 17.70 (+9.80%) on VGGSound, SI-SDR 1.98dB to 3.35dB (+69.19%), MOS 2.94 to 3.49(+18.71%) on PianoYT-2h, and SI-SDR 2.22dB to 3.21dB (+44.59%), MOS 3.07 to 3.42 (+11.40%) on Piano-10h.

Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference Optimization

TL;DR

This work addresses the challenge of producing high-quality video-to-audio and professional audio by introducing Chain-of-Perform (CoP) guidance in a three-stage, multi-modal framework built on flow-matching transformers. Stage 1 trains a general V2A/T2A foundation, Stage 2 adds piano-specific modules (Extra-DiT and Roll Predictor) for professional audio, and Stage 3 applies contrastive learning and Direct Preference Optimization to refine cross-modal alignment and stylistic fidelity. A new Piano-10h CoP dataset provides step-by-step MIDI-grounded guidance to support stepwise piano audio generation. Empirical results show improvements across general audio metrics (FAD, CLIP, AV-align) and piano-specific metrics (SI-SDR, MOS, MIDI precision/recall/F1), demonstrating the framework’s effectiveness in both broad and specialized audio generation tasks with enhanced semantic/temporal alignment and controllable style.

Abstract

Creating high-quality sound effects from videos and text prompts requires precise alignment between visual and audio domains, both semantically and temporally, along with step-by-step guidance for professional audio generation. However, current state-of-the-art video-guided audio generation models often fall short of producing high-quality audio for both general and specialized use cases. To address this challenge, we introduce a multi-stage, multi-modal, end-to-end generative framework with Chain-of-Thought-like (CoT-like) guidance learning, termed Chain-of-Perform (CoP). First, we employ a transformer-based network architecture designed to achieve CoP guidance, enabling the generation of both general and professional audio. Second, we implement a multi-stage training framework that follows step-by-step guidance to ensure the generation of high-quality sound effects. Third, we develop a CoP multi-modal dataset, guided by video, to support step-by-step sound effects generation. Evaluation results highlight the advantages of the proposed multi-stage CoP generative framework compared to the state-of-the-art models on a variety of datasets, with FAD 0.79 to 0.74 (+6.33%), CLIP 16.12 to 17.70 (+9.80%) on VGGSound, SI-SDR 1.98dB to 3.35dB (+69.19%), MOS 2.94 to 3.49(+18.71%) on PianoYT-2h, and SI-SDR 2.22dB to 3.21dB (+44.59%), MOS 3.07 to 3.42 (+11.40%) on Piano-10h.

Paper Structure

This paper contains 22 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Multi-stage training pipeline of our method.
  • Figure 2: Five views of the Piano-10h dataset supporting step-by-step generation tasks.
  • Figure 3: Mel spectrogram example for contrastive learning in VGGsound test set.
  • Figure 4: Mel spectrogram example for the piano model in VGGSound test set. The high similarity of Mel spectrograms demonstrates that our piano model maintains general V2A capabilities.