Visual and audio scene classification for detecting discrepancies in video: a baseline method and experimental protocol
Konstantinos Apostolidis, Jakob Abesser, Luca Cuccovillo, Vasileios Mezaris
TL;DR
The paper addresses detecting audio-visual inconsistencies in video content by proposing a baseline multimodal scene classifier and introducing the Visual-Audio Discrepancy Detection (VADD) protocol. It leverages a joint audio-visual encoder with multiple pretrained visual and audio embeddings and self-attention, achieving state-of-the-art performance on TAU AVSC and promising results on AV discrepancy detection. A key contribution is the VADD benchmark and accompanying dataset/code, enabling standardized evaluation and further research in content verification. The work lays a foundation for robust multimedia integrity checks and points toward future improvements in fusion strategies, contrastive learning, and temporal analysis.
Abstract
This paper presents a baseline approach and an experimental protocol for a specific content verification problem: detecting discrepancies between the audio and video modalities in multimedia content. We first design and optimize an audio-visual scene classifier, to compare with existing classification baselines that use both modalities. Then, by applying this classifier separately to the audio and the visual modality, we can detect scene-class inconsistencies between them. To facilitate further research and provide a common evaluation platform, we introduce an experimental protocol and a benchmark dataset simulating such inconsistencies. Our approach achieves state-of-the-art results in scene classification and promising outcomes in audio-visual discrepancies detection, highlighting its potential in content verification applications.
