Table of Contents
Fetching ...

Semantically consistent Video-to-Audio Generation using Multimodal Language Large Model

Gehui Chen, Guan'an Wang, Xiaowen Huang, Jitao Sang

TL;DR

The paper tackles automatic generation of semantically consistent SFX and BGM for silent videos by reframing video-to-audio generation as a text-conditioned task using a multimodal language model (MLLM). The SVA framework uses a key-frame from the video to produce SFX/BGM schemes via MLLM prompts, which are then realized by text-to-audio models (AudioGen and MusicGen) and refined through post-processing. This approach enables natural language interfaces and avoids end-to-end training, demonstrated through case studies showing semantically aligned audio with high quality. Limitations include coarse-grained video-to-audio semantics and lack of universal evaluation metrics or benchmarks, with future work targeting finer temporal synchronization and comprehensive benchmarks for broader generalization.

Abstract

Existing works have made strides in video generation, but the lack of sound effects (SFX) and background music (BGM) hinders a complete and immersive viewer experience. We introduce a novel semantically consistent v ideo-to-audio generation framework, namely SVA, which automatically generates audio semantically consistent with the given video content. The framework harnesses the power of multimodal large language model (MLLM) to understand video semantics from a key frame and generate creative audio schemes, which are then utilized as prompts for text-to-audio models, resulting in video-to-audio generation with natural language as an interface. We show the satisfactory performance of SVA through case study and discuss the limitations along with the future research direction. The project page is available at https://huiz-a.github.io/audio4video.github.io/.

Semantically consistent Video-to-Audio Generation using Multimodal Language Large Model

TL;DR

The paper tackles automatic generation of semantically consistent SFX and BGM for silent videos by reframing video-to-audio generation as a text-conditioned task using a multimodal language model (MLLM). The SVA framework uses a key-frame from the video to produce SFX/BGM schemes via MLLM prompts, which are then realized by text-to-audio models (AudioGen and MusicGen) and refined through post-processing. This approach enables natural language interfaces and avoids end-to-end training, demonstrated through case studies showing semantically aligned audio with high quality. Limitations include coarse-grained video-to-audio semantics and lack of universal evaluation metrics or benchmarks, with future work targeting finer temporal synchronization and comprehensive benchmarks for broader generalization.

Abstract

Existing works have made strides in video generation, but the lack of sound effects (SFX) and background music (BGM) hinders a complete and immersive viewer experience. We introduce a novel semantically consistent v ideo-to-audio generation framework, namely SVA, which automatically generates audio semantically consistent with the given video content. The framework harnesses the power of multimodal large language model (MLLM) to understand video semantics from a key frame and generate creative audio schemes, which are then utilized as prompts for text-to-audio models, resulting in video-to-audio generation with natural language as an interface. We show the satisfactory performance of SVA through case study and discuss the limitations along with the future research direction. The project page is available at https://huiz-a.github.io/audio4video.github.io/.
Paper Structure (9 sections, 4 figures)

This paper contains 9 sections, 4 figures.

Figures (4)

  • Figure 1: The overview of SVA framework. Initially, a key frame (highlighted in red) is randomly selected from the video frames. We then prompt Gemini Pro to generate a SFX and BGM scheme, which comprising two different SFX descriptions and one BGM description. Subsequently, these descriptions are inputted into AudioGen and MusicGen respectively, resulting in corresponding SFX and BGM waveform files. Then we run post-processing for noise removal and reduction. Finally, all multimodal data, including video frames, SFX waveform files, and one BGM waveform file are merged to create the video with audio.
  • Figure 2: The template for prompting MLLM to generate video description and audio scheme. The process will be executed from top to bottom. The red-highlighted parts are placeholders, to be replaced according to the actual user input and MLLM output. Some non-critical content is omitted.
  • Figure 3: The template for personalized scheme generation. The process will be executed from top to bottom. The red-highlighted parts are placeholders, to be replaced according to the actual user input and MLLM output. Some non-critical content has been omitted.
  • Figure 4: The visualization cases of SVA. Figure \ref{['fig:fig4']}(a) presents the outputs from MLLM during the prompt generation phase. Figure \ref{['fig:fig4']}(b) demonstrates the filtering and denoising procedures undertaken during the post-processing phase.