Video-to-Audio Generation with Hidden Alignment
Manjie Xu, Chenxing Li, Xinyi Tu, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, Dong Yu
TL;DR
This work tackles the problem of generating semantically and temporally aligned audio from silent video using a diffusion-based video-to-audio framework called VTA-LDM. By conditioning a latent diffusion model on projected vision features and decoding through a pre-trained AudioVAE and vocoder, the approach achieves strong semantic coherence and partial temporal synchronization, validated by both objective metrics and human evaluations. The study provides in-depth ablations on vision encoders, auxiliary embeddings, and data augmentation, revealing that Clip4Clip-based vision features, supplemental textual/positional cues, and carefully designed data augmentation jointly enhance audio quality and video-audio alignment. The findings offer practical guidelines for designing audio-visual generation systems and highlight remaining challenges in achieving fully natural and synchronized video-derived audio, especially in complex or silent-object scenarios.
Abstract
Generating semantically and temporally aligned audio content in accordance with video input has become a focal point for researchers, particularly following the remarkable breakthrough in text-to-video generation. In this work, we aim to offer insights into the video-to-audio generation paradigm, focusing on three crucial aspects: vision encoders, auxiliary embeddings, and data augmentation techniques. Beginning with a foundational model built on a simple yet surprisingly effective intuition, we explore various vision encoders and auxiliary embeddings through ablation studies. Employing a comprehensive evaluation pipeline that emphasizes generation quality and video-audio synchronization alignment, we demonstrate that our model exhibits state-of-the-art video-to-audio generation capabilities. Furthermore, we provide critical insights into the impact of different data augmentation methods on enhancing the generation framework's overall capacity. We showcase possibilities to advance the challenge of generating synchronized audio from semantic and temporal perspectives. We hope these insights will serve as a stepping stone toward developing more realistic and accurate audio-visual generation models.
