Table of Contents
Fetching ...

Visual and audio scene classification for detecting discrepancies in video: a baseline method and experimental protocol

Konstantinos Apostolidis, Jakob Abesser, Luca Cuccovillo, Vasileios Mezaris

TL;DR

The paper addresses detecting audio-visual inconsistencies in video content by proposing a baseline multimodal scene classifier and introducing the Visual-Audio Discrepancy Detection (VADD) protocol. It leverages a joint audio-visual encoder with multiple pretrained visual and audio embeddings and self-attention, achieving state-of-the-art performance on TAU AVSC and promising results on AV discrepancy detection. A key contribution is the VADD benchmark and accompanying dataset/code, enabling standardized evaluation and further research in content verification. The work lays a foundation for robust multimedia integrity checks and points toward future improvements in fusion strategies, contrastive learning, and temporal analysis.

Abstract

This paper presents a baseline approach and an experimental protocol for a specific content verification problem: detecting discrepancies between the audio and video modalities in multimedia content. We first design and optimize an audio-visual scene classifier, to compare with existing classification baselines that use both modalities. Then, by applying this classifier separately to the audio and the visual modality, we can detect scene-class inconsistencies between them. To facilitate further research and provide a common evaluation platform, we introduce an experimental protocol and a benchmark dataset simulating such inconsistencies. Our approach achieves state-of-the-art results in scene classification and promising outcomes in audio-visual discrepancies detection, highlighting its potential in content verification applications.

Visual and audio scene classification for detecting discrepancies in video: a baseline method and experimental protocol

TL;DR

The paper addresses detecting audio-visual inconsistencies in video content by proposing a baseline multimodal scene classifier and introducing the Visual-Audio Discrepancy Detection (VADD) protocol. It leverages a joint audio-visual encoder with multiple pretrained visual and audio embeddings and self-attention, achieving state-of-the-art performance on TAU AVSC and promising results on AV discrepancy detection. A key contribution is the VADD benchmark and accompanying dataset/code, enabling standardized evaluation and further research in content verification. The work lays a foundation for robust multimedia integrity checks and points toward future improvements in fusion strategies, contrastive learning, and temporal analysis.

Abstract

This paper presents a baseline approach and an experimental protocol for a specific content verification problem: detecting discrepancies between the audio and video modalities in multimedia content. We first design and optimize an audio-visual scene classifier, to compare with existing classification baselines that use both modalities. Then, by applying this classifier separately to the audio and the visual modality, we can detect scene-class inconsistencies between them. To facilitate further research and provide a common evaluation platform, we introduce an experimental protocol and a benchmark dataset simulating such inconsistencies. Our approach achieves state-of-the-art results in scene classification and promising outcomes in audio-visual discrepancies detection, highlighting its potential in content verification applications.
Paper Structure (17 sections, 5 figures, 4 tables)

This paper contains 17 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The overall procedure employed in this paper. The red blocks represent the ensemble of visual embeddings (three blank rectangles inside). The blue blocks represent the ensemble of audio embeddings (three blank rectangles inside).
  • Figure 2: The architecture of the employed audio-visual scene classifier.
  • Figure 3: Confusion Matrix for our visual-audio scene classifier on the 10-class variant of the VADD dataset.
  • Figure 4: The architecture of the early self-attention (ES) variant.
  • Figure 5: The architecture of the per-modality self-attention (MS) variant.