Table of Contents
Fetching ...

Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation

Kang Zhang, Trung X. Pham, Suyeon Lee, Axi Niu, Arda Senocak, Joon Son Chung

TL;DR

MGAudio tackles open-domain video-to-audio generation by replacing classifier-free guidance with a model-guided training objective and a dual-role audio-visual encoder that jointly conditions and aligns representations. The Flow-Based Denoising Transformer performs flow-matching in a video-conditioned latent space, while Audio Model-Guidance provides direct supervision and training stability; a dual encoder facilitates cross-modal alignment. On VGGSound, MGAudio achieves a state-of-the-art Fréchet Audio Distance of $0.40$ with 131M parameters and generalizes well to UnAV-100, even when trained with as little as 10% of the data, illustrating data efficiency and robustness. The work also shows that combining AMG with CFG at inference yields the best fidelity and alignment, highlighting the practical viability of model-guided multimodal generation for open-domain audio synthesis.

Abstract

We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer model, (2) a dual-role alignment mechanism where the audio-visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model-guided objective that enhances cross-modal coherence and audio realism. MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging UnAV-100 benchmark. These results highlight model-guided dual-role alignment as a powerful and scalable paradigm for conditional video-to-audio generation. Code is available at: https://github.com/pantheon5100/mgaudio

Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation

TL;DR

MGAudio tackles open-domain video-to-audio generation by replacing classifier-free guidance with a model-guided training objective and a dual-role audio-visual encoder that jointly conditions and aligns representations. The Flow-Based Denoising Transformer performs flow-matching in a video-conditioned latent space, while Audio Model-Guidance provides direct supervision and training stability; a dual encoder facilitates cross-modal alignment. On VGGSound, MGAudio achieves a state-of-the-art Fréchet Audio Distance of with 131M parameters and generalizes well to UnAV-100, even when trained with as little as 10% of the data, illustrating data efficiency and robustness. The work also shows that combining AMG with CFG at inference yields the best fidelity and alignment, highlighting the practical viability of model-guided multimodal generation for open-domain audio synthesis.

Abstract

We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer model, (2) a dual-role alignment mechanism where the audio-visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model-guided objective that enhances cross-modal coherence and audio realism. MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging UnAV-100 benchmark. These results highlight model-guided dual-role alignment as a powerful and scalable paradigm for conditional video-to-audio generation. Code is available at: https://github.com/pantheon5100/mgaudio

Paper Structure

This paper contains 40 sections, 15 equations, 13 figures, 17 tables.

Figures (13)

  • Figure 1: V2A on VGGSound chen2020vggsound.MGAudio attains the best FAD among video-to-audio methods with full data and 1.1M iters, and remains competitive even with only 10% data and 300k iters, highlighting its strong data efficiency.
  • Figure 2: Overview of the MGAudio framework for video-guided audio generation. We design the first use of a model-guidance method that learns audio latent with a new objective and performs dual alignment learning. The violet arrow$\color{violet}\rightarrow$ is training-only, the black arrow$\color{black}\rightarrow$ is used for both training and inference, and the blue arrow$\color{blue}\downarrow$ is inference-only. MHSA: Multi-Head Self-Attention.
  • Figure 3: Effect of Alignment Encoder in Mel-Spectrogram. The selection of alignment encoders significantly impacts the quality of generated audio in the V2A task.
  • Figure 4: Audio Distribution.MGAudio generates audio samples that more closely align with the target distribution compared to other methods. For example, samples for classes "playing djembe" (red points) and "scuba diving" (black points) are tightly clustered around the center of the real sample distribution. Full-resolution version are provided in the supplementary for clearer inspection.
  • Figure 5: Effect of CFG vs. AMG. Training with AMG consistently outperforms CFG across all evaluation metrics on the video-to-audio task, highlighting the advantages of model-guided learning.
  • ...and 8 more figures