Semantically consistent Video-to-Audio Generation using Multimodal Language Large Model
Gehui Chen, Guan'an Wang, Xiaowen Huang, Jitao Sang
TL;DR
The paper tackles automatic generation of semantically consistent SFX and BGM for silent videos by reframing video-to-audio generation as a text-conditioned task using a multimodal language model (MLLM). The SVA framework uses a key-frame from the video to produce SFX/BGM schemes via MLLM prompts, which are then realized by text-to-audio models (AudioGen and MusicGen) and refined through post-processing. This approach enables natural language interfaces and avoids end-to-end training, demonstrated through case studies showing semantically aligned audio with high quality. Limitations include coarse-grained video-to-audio semantics and lack of universal evaluation metrics or benchmarks, with future work targeting finer temporal synchronization and comprehensive benchmarks for broader generalization.
Abstract
Existing works have made strides in video generation, but the lack of sound effects (SFX) and background music (BGM) hinders a complete and immersive viewer experience. We introduce a novel semantically consistent v ideo-to-audio generation framework, namely SVA, which automatically generates audio semantically consistent with the given video content. The framework harnesses the power of multimodal large language model (MLLM) to understand video semantics from a key frame and generate creative audio schemes, which are then utilized as prompts for text-to-audio models, resulting in video-to-audio generation with natural language as an interface. We show the satisfactory performance of SVA through case study and discuss the limitations along with the future research direction. The project page is available at https://huiz-a.github.io/audio4video.github.io/.
