Table of Contents
Fetching ...

Efficient Video to Audio Mapper with Visual Scene Detection

Mingjing Yi, Ming Li

TL;DR

This work tackles video-to-audio (V2A) generation in the presence of multiple scenes by introducing scene-aware mechanisms. It first reimplements a state-of-the-art V2A baseline with a lightweight MLP mapper and then adds a scene detector to segment videos into single-scene units, training per-scene mappings to improve fidelity and semantic relevance. On the VGGSound dataset, the proposed V2A-MLP and V2A-SceneDetector variants achieve competitive fidelity metrics and substantially higher CLIP-based relevance when scene segmentation is employed, demonstrating improved handling of multi-scene videos. The study highlights the value of scene-aware processing for cross-modal generation, while noting limitations in temporal synchronization, segment duration variability, and transition smoothness, with future work aimed at addressing these issues and broader comparisons.

Abstract

Video-to-audio (V2A) generation aims to produce corresponding audio given silent video inputs. This task is particularly challenging due to the cross-modality and sequential nature of the audio-visual features involved. Recent works have made significant progress in bridging the domain gap between video and audio, generating audio that is semantically aligned with the video content. However, a critical limitation of these approaches is their inability to effectively recognize and handle multiple scenes within a video, often leading to suboptimal audio generation in such cases. In this paper, we first reimplement a state-of-the-art V2A model with a slightly modified light-weight architecture, achieving results that outperform the baseline. We then propose an improved V2A model that incorporates a scene detector to address the challenge of switching between multiple visual scenes. Results on VGGSound show that our model can recognize and handle multiple scenes within a video and achieve superior performance against the baseline for both fidelity and relevance.

Efficient Video to Audio Mapper with Visual Scene Detection

TL;DR

This work tackles video-to-audio (V2A) generation in the presence of multiple scenes by introducing scene-aware mechanisms. It first reimplements a state-of-the-art V2A baseline with a lightweight MLP mapper and then adds a scene detector to segment videos into single-scene units, training per-scene mappings to improve fidelity and semantic relevance. On the VGGSound dataset, the proposed V2A-MLP and V2A-SceneDetector variants achieve competitive fidelity metrics and substantially higher CLIP-based relevance when scene segmentation is employed, demonstrating improved handling of multi-scene videos. The study highlights the value of scene-aware processing for cross-modal generation, while noting limitations in temporal synchronization, segment duration variability, and transition smoothness, with future work aimed at addressing these issues and broader comparisons.

Abstract

Video-to-audio (V2A) generation aims to produce corresponding audio given silent video inputs. This task is particularly challenging due to the cross-modality and sequential nature of the audio-visual features involved. Recent works have made significant progress in bridging the domain gap between video and audio, generating audio that is semantically aligned with the video content. However, a critical limitation of these approaches is their inability to effectively recognize and handle multiple scenes within a video, often leading to suboptimal audio generation in such cases. In this paper, we first reimplement a state-of-the-art V2A model with a slightly modified light-weight architecture, achieving results that outperform the baseline. We then propose an improved V2A model that incorporates a scene detector to address the challenge of switching between multiple visual scenes. Results on VGGSound show that our model can recognize and handle multiple scenes within a video and achieve superior performance against the baseline for both fidelity and relevance.
Paper Structure (14 sections, 5 equations, 1 figure, 2 tables)

This paper contains 14 sections, 5 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Overview of our V2A model. Left: The training process of V2A-SceneDetector. We utilize pretrained CLIP and CLAP models for feature extraction. By utilizing scene detector, we can identify the scene information and scene boundary between scenes for audio segmentation. Right: Inference pipeline. The top one shows the pipeline with scene segmentation. Sharing the preprocess with training process, we condition on the predicted CLAP embedding to generate audio via AudioLDM. the bottom shows the process of inference without scene segmentation.