Efficient Video to Audio Mapper with Visual Scene Detection
Mingjing Yi, Ming Li
TL;DR
This work tackles video-to-audio (V2A) generation in the presence of multiple scenes by introducing scene-aware mechanisms. It first reimplements a state-of-the-art V2A baseline with a lightweight MLP mapper and then adds a scene detector to segment videos into single-scene units, training per-scene mappings to improve fidelity and semantic relevance. On the VGGSound dataset, the proposed V2A-MLP and V2A-SceneDetector variants achieve competitive fidelity metrics and substantially higher CLIP-based relevance when scene segmentation is employed, demonstrating improved handling of multi-scene videos. The study highlights the value of scene-aware processing for cross-modal generation, while noting limitations in temporal synchronization, segment duration variability, and transition smoothness, with future work aimed at addressing these issues and broader comparisons.
Abstract
Video-to-audio (V2A) generation aims to produce corresponding audio given silent video inputs. This task is particularly challenging due to the cross-modality and sequential nature of the audio-visual features involved. Recent works have made significant progress in bridging the domain gap between video and audio, generating audio that is semantically aligned with the video content. However, a critical limitation of these approaches is their inability to effectively recognize and handle multiple scenes within a video, often leading to suboptimal audio generation in such cases. In this paper, we first reimplement a state-of-the-art V2A model with a slightly modified light-weight architecture, achieving results that outperform the baseline. We then propose an improved V2A model that incorporates a scene detector to address the challenge of switching between multiple visual scenes. Results on VGGSound show that our model can recognize and handle multiple scenes within a video and achieve superior performance against the baseline for both fidelity and relevance.
