Table of Contents
Fetching ...

LINK: Adaptive Modality Interaction for Audio-Visual Video Parsing

Langyu Wang, Bingke Zhu, Yingying Chen, Jinqiao Wang

TL;DR

The paper tackles weakly supervised audio-visual video parsing when audio and visual streams are frequently misaligned. It proposes LINK, a framework built from three components—Temporal-Spatial Attention with Adaptive Modality Interaction (TSAM), Segmented Audio-Visual Semantic Similarity Loss (S-LOSS), and Pseudo Label Semantic Interaction Module (PLSIM)—to balance modality contributions and inject semantic priors from pseudo-labels via CLIP/CLAP. The training loss combines a segment-weighted, cosine-similarity guided objective with pseudo-label semantics to suppress cross-modal noise and improve uni- and multi-modal predictions. Empirical results on the LLP dataset demonstrate state-of-the-art performance and robust improvements over baselines, highlighting the practical impact of adaptive fusion and semantic guidance for non-aligned AVVP tasks.

Abstract

Audio-visual video parsing focuses on classifying videos through weak labels while identifying events as either visible, audible, or both, alongside their respective temporal boundaries. Many methods ignore that different modalities often lack alignment, thereby introducing extra noise during modal interaction. In this work, we introduce a Learning Interaction method for Non-aligned Knowledge (LINK), designed to equilibrate the contributions of distinct modalities by dynamically adjusting their input during event prediction. Additionally, we leverage the semantic information of pseudo-labels as a priori knowledge to mitigate noise from other modalities. Our experimental findings demonstrate that our model outperforms existing methods on the LLP dataset.

LINK: Adaptive Modality Interaction for Audio-Visual Video Parsing

TL;DR

The paper tackles weakly supervised audio-visual video parsing when audio and visual streams are frequently misaligned. It proposes LINK, a framework built from three components—Temporal-Spatial Attention with Adaptive Modality Interaction (TSAM), Segmented Audio-Visual Semantic Similarity Loss (S-LOSS), and Pseudo Label Semantic Interaction Module (PLSIM)—to balance modality contributions and inject semantic priors from pseudo-labels via CLIP/CLAP. The training loss combines a segment-weighted, cosine-similarity guided objective with pseudo-label semantics to suppress cross-modal noise and improve uni- and multi-modal predictions. Empirical results on the LLP dataset demonstrate state-of-the-art performance and robust improvements over baselines, highlighting the practical impact of adaptive fusion and semantic guidance for non-aligned AVVP tasks.

Abstract

Audio-visual video parsing focuses on classifying videos through weak labels while identifying events as either visible, audible, or both, alongside their respective temporal boundaries. Many methods ignore that different modalities often lack alignment, thereby introducing extra noise during modal interaction. In this work, we introduce a Learning Interaction method for Non-aligned Knowledge (LINK), designed to equilibrate the contributions of distinct modalities by dynamically adjusting their input during event prediction. Additionally, we leverage the semantic information of pseudo-labels as a priori knowledge to mitigate noise from other modalities. Our experimental findings demonstrate that our model outperforms existing methods on the LLP dataset.
Paper Structure (11 sections, 17 equations, 2 figures, 2 tables)

This paper contains 11 sections, 17 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Modality non-aligned samples from LLP. Existing method is vulnerable to non-aligned events and produce incorrect predictions.
  • Figure 2: The framework of LINK. We use temporal-spatial attention and cross modal interaction module to enhance the expression of feature, and merge the semantic information from pseudo label with uni-modal feature. The pseudo labels are extracted by VALORb8.