Table of Contents
Fetching ...

CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment

Yunzuo Hu, Wen Li, Jing Zhang

TL;DR

A novel Caption-aligned and Agreement-guided Enhancement framework for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment.

Abstract

Audio-visual learning suffers from modality misalignment caused by off-screen sources and background clutter, and current methods usually amplify irrelevant regions or moments, leading to unstable training and degraded representation quality. To address this challenge, we proposed a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment. CASTE dynamically balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment. CASE injects cross-modal semantic guidance into selected spatio-temporal positions, leveraging high-level semantic cues to further alleviate misalignment. In addition, we design lightweight objectives, caption-to-modality InfoNCE, visual-audio consistency, and entropy regularization to guide token selection and strengthen cross-modal semantic alignment. With frozen backbones, CAE-AV achieves state-of-the-art performance on AVE, AVVP, AVS, and AVQA benchmarks, and qualitative analyses further validate its robustness against audio-visual misalignment.

CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment

TL;DR

A novel Caption-aligned and Agreement-guided Enhancement framework for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment.

Abstract

Audio-visual learning suffers from modality misalignment caused by off-screen sources and background clutter, and current methods usually amplify irrelevant regions or moments, leading to unstable training and degraded representation quality. To address this challenge, we proposed a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment. CASTE dynamically balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment. CASE injects cross-modal semantic guidance into selected spatio-temporal positions, leveraging high-level semantic cues to further alleviate misalignment. In addition, we design lightweight objectives, caption-to-modality InfoNCE, visual-audio consistency, and entropy regularization to guide token selection and strengthen cross-modal semantic alignment. With frozen backbones, CAE-AV achieves state-of-the-art performance on AVE, AVVP, AVS, and AVQA benchmarks, and qualitative analyses further validate its robustness against audio-visual misalignment.
Paper Structure (23 sections, 25 equations, 8 figures, 8 tables)

This paper contains 23 sections, 25 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Two distinct audio-visual misalignment scenarios. (a) the singing continues while the camera cuts to a distant view, rendering the audio event temporarily invisible in the video. (b) a little girl is playing the violin, but the audio track is playing piano.
  • Figure 2: Method overview. CAE-AV injects CASTE and AVMoE into frozen pre-trained backbones. It constructs prompts for raw visual frames and audio to generate captions. Finally, the captions are projected into the feature space of the backbones to guide the inference of visual and audio features in CASE.
  • Figure 3: The structure of CASE and the calculation of loss.
  • Figure 4: Qualiative examples of the AVMoE and our CAE-AV, under the S4 setting of the AVS task. The first and second rows of each subgraph represent the Audio Input and Visual Input, respectively. The third row shows the Ground Truth, while the fourth and fifth rows display the results of the AVMoE and our CAE-AV in locating and outlining object shapes. The sixth row presents the visual caption and audio caption used in CAE-AV to provide semantic information to the model.
  • Figure 5: Qualiative examples of the AVMoE and our CAE-AV, under the MS3 setting of the AVS task.
  • ...and 3 more figures