Table of Contents
Fetching ...

Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning

Le Xu, Chenxing Li, Yong Ren, Yujie Chen, Yu Gu, Ruibo Fu, Shan Yang, Dong Yu

TL;DR

The paper tackles audiovisual mismatch in vision-guided audio captioning by introducing EVACap, which uses entropy-aware cross-attention gating to dynamically regulate visual input and a batch-wise Stochastic Modality Shuffling strategy to simulate and mitigate misalignment. The approach leverages a CAV-MAE based multimodal encoder with ViT-based visual and audio features, and achieves robustness with minimal supervision. Experiments on AudioCaps show competitive semantic accuracy and superior BLEU performance, plus about a 6x faster inference compared with baselines, highlighting practical deployment benefits. Overall, the method advances robustness to real-world audiovisual inconsistencies while improving efficiency in long-sequence captioning tasks.

Abstract

Current vision-guided audio captioning systems frequently fail to address audiovisual misalignment in real-world scenarios, such as dubbed content or off-screen sounds. To bridge this critical gap, we present an entropy-aware gated fusion framework that dynamically modulates visual information flow through cross-modal uncertainty quantification. Our novel approach employs attention entropy analysis in cross-attention layers to automatically identify and suppress misleading visual cues during modal fusion. Complementing this architecture, we develop a batch-wise audiovisual shuffling technique that generates synthetic mismatched training pairs, greatly enhancing model resilience against alignment noise. Evaluations on the AudioCaps benchmark demonstrate our system's superior performance over existing baselines, especially in mismatched modality scenarios. Furthermore, our solution demonstrates an approximately 6x improvement in inference speed compared to the baseline.

Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning

TL;DR

The paper tackles audiovisual mismatch in vision-guided audio captioning by introducing EVACap, which uses entropy-aware cross-attention gating to dynamically regulate visual input and a batch-wise Stochastic Modality Shuffling strategy to simulate and mitigate misalignment. The approach leverages a CAV-MAE based multimodal encoder with ViT-based visual and audio features, and achieves robustness with minimal supervision. Experiments on AudioCaps show competitive semantic accuracy and superior BLEU performance, plus about a 6x faster inference compared with baselines, highlighting practical deployment benefits. Overall, the method advances robustness to real-world audiovisual inconsistencies while improving efficiency in long-sequence captioning tasks.

Abstract

Current vision-guided audio captioning systems frequently fail to address audiovisual misalignment in real-world scenarios, such as dubbed content or off-screen sounds. To bridge this critical gap, we present an entropy-aware gated fusion framework that dynamically modulates visual information flow through cross-modal uncertainty quantification. Our novel approach employs attention entropy analysis in cross-attention layers to automatically identify and suppress misleading visual cues during modal fusion. Complementing this architecture, we develop a batch-wise audiovisual shuffling technique that generates synthetic mismatched training pairs, greatly enhancing model resilience against alignment noise. Evaluations on the AudioCaps benchmark demonstrate our system's superior performance over existing baselines, especially in mismatched modality scenarios. Furthermore, our solution demonstrates an approximately 6x improvement in inference speed compared to the baseline.

Paper Structure

This paper contains 17 sections, 3 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: (a) Overview of the proposed EVACap Framework. (b) Detail of the Entropy-aware Gated Fusion module.
  • Figure 2: Inference time comparison (seconds) across different length of input video frames. 0 Input Frames denotes a baseline with no video modal inputs.