V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation

Nolan Chan; Timmy Gang; Yongqian Wang; Yuzhe Liang; Dingdong Wang

V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation

Nolan Chan, Timmy Gang, Yongqian Wang, Yuzhe Liang, Dingdong Wang

Abstract

This paper introduces V2A-DPO, a novel Direct Preference Optimization (DPO) framework tailored for flow-based video-to-audio generation (V2A) models, incorporating key adaptations to effectively align generated audio with human preferences. Our approach incorporates three core innovations: (1) AudioScore-a comprehensive human preference-aligned scoring system for assessing semantic consistency, temporal alignment, and perceptual quality of synthesized audio; (2) an automated AudioScore-driven pipeline for generating large-scale preference pair data for DPO optimization; (3) a curriculum learning-empowered DPO optimization strategy specifically tailored for flow-based generative models. Experiments on benchmark VGGSound dataset demonstrate that human-preference aligned Frieren and MMAudio using V2A-DPO outperform their counterparts optimized using Denoising Diffusion Policy Optimization (DDPO) as well as pre-trained baselines. Furthermore, our DPO-optimized MMAudio achieves state-of-the-art performance across multiple metrics, surpassing published V2A models.

V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation

Abstract

Paper Structure (9 sections, 5 equations, 2 figures, 2 tables)

This paper contains 9 sections, 5 equations, 2 figures, 2 tables.

Introduction
Method
AudioScore
Omni-Preference Pair Data Generation
Curriculum Learning-Empowered DPO
Experiments
Experimental setup
Experimental results
Conclusion

Figures (2)

Figure 1: Illustration of the proposed V2A-DPO framework, including (a) our proposed AudioScore to rate the generated audio with multi-dimensional scores; (b) omni-preference pair data generation combining the automatically generated preference pairs based on AudioScore with a small amount of human-annotated preference pairs; (c) curriculum learning-empowered DPO to optimize V2A models on the complex and simple pairs gradually, which are split according to complexity score $score_c$.
Figure 2: Illustration of generation performance of V2A models.

V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation

Abstract

V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation

Authors

Abstract

Table of Contents

Figures (2)