Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

Ling Xing; Hongyu Qu; Rui Yan; Xiangbo Shu; Jinhui Tang

Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

Ling Xing, Hongyu Qu, Rui Yan, Xiangbo Shu, Jinhui Tang

TL;DR

LoCo tackles dense, multi-event audio-visual localization by leveraging local cross-modal coherence to guide both unimodal representation learning and cross-modal fusion. It introduces Local Correspondence Feature (LCF) Modulation to emphasize modality-shared semantics in the unimodal encoders and Local Adaptive Cross-modal (LAC) Interaction to adaptively aggregate cross-modal features across multiple temporal scales. The approach employs a label-free, coherence-based contrastive objective and a window-aware, data-driven cross-modal attention mechanism, integrated within a multi-scale temporal pyramid and a multimodal decoder. Empirically, LoCo achieves state-of-the-art results on UnAV-100 and AVEL benchmarks, with notable gains in mAP and precise boundary localization, demonstrating superior robustness to varying event durations and overlapping events in long videos.

Abstract

Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that are both audible and visible in a long video, where events may co-occur and exhibit varying durations. However, complex audio-visual scenes often involve asynchronization between modalities, making accurate localization challenging. Existing DAVE solutions extract audio and visual features through unimodal encoders, and fuse them via dense cross-modal interaction. However, independent unimodal encoding struggles to emphasize shared semantics between modalities without cross-modal guidance, while dense cross-modal attention may over-attend to semantically unrelated audio-visual features. To address these problems, we present LoCo, a Locality-aware cross-modal Correspondence learning framework for DAVE. LoCo leverages the local temporal continuity of audio-visual events as important guidance to filter irrelevant cross-modal signals and enhance cross-modal alignment throughout both unimodal and cross-modal encoding stages. i) Specifically, LoCo applies Local Correspondence Feature (LCF) Modulation to enforce unimodal encoders to focus on modality-shared semantics by modulating agreement between audio and visual features based on local cross-modal coherence. ii) To better aggregate cross-modal relevant features, we further customize Local Adaptive Cross-modal (LAC) Interaction, which dynamically adjusts attention regions in a data-driven manner. This adaptive mechanism focuses attention on local event boundaries and accommodates varying event durations. By incorporating LCF and LAC, LoCo provides solid performance gains and outperforms existing DAVE methods.

Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

TL;DR

Abstract

Paper Structure (19 sections, 12 equations, 6 figures, 8 tables)

This paper contains 19 sections, 12 equations, 6 figures, 8 tables.

Introduction
Related Work
Audio-Visual Event Localization
Dense-localizing Audio-Visual Events
Uni-Modal Temporal Action Detection
Method
Problem Statement
Overall Framework
Local Correspondence Feature Modulation
Local Adaptive Cross-modal Interaction
Training and Inference
Experiments
Experimental Setup
Implementation Details
Comparison with State-of-the-Arts
...and 4 more sections

Figures (6)

Figure 1: Existing DAVE methods typically extract audio and visual features using separate unimodal encoders (i.e., unimodal encoding stage), and fuse them through dense cross-attention interaction. Such solutions suffer from two key issues.i) Independent unimodal encoding underemphasizes shared semantics between audio and visual signals in the absence of cross-modal mutual guidance, hindering the ability of the model to suppress modality-specific noise (e.g., the red dashed circles in (b)). ii) Dense cross-attention interaction over-attends to irrelevant cross-modal contents (e.g., the gray dashed lines in (b)), introducing semantic confusion.
Figure 2: Overview of LoCo. Visual and audio inputs are first processed by unimodal encoders to generate initial features. Then, LoCo applies LCF to pose constraints on these initial features, emphasizing modality-shared semantics. Furthermore, the adaptive cross-modal interaction pyramid adaptively adjusts cross-modal attention area based on inputs at all pyramid levels to enhance intra-event integrity, which consists of $L_\text{c}$ LAC blocks and yields multimodal feature pyramid. Finally, the multi-modal decoder identifies categories and time boundaries for audio-visual events.
Figure 3: Qualitative results show the effect of LCF (c.f.\ref{['sec:LCC']}), which increases feature discriminability. The cross-similarity matrix (CSM) is calculated between audio and visual features at different timestamps within the same video. For all videos in UnAV-100$_{\!}$geng2023dense test split, the standard deviation of the CSM is calculated, and the average of them is denoted as "Mean of std". The increased "Mean of std" suggests richer and more distinguishable representations. We randomly present the CSM of two videos equipped with "I3D-VGGish" features i3dvggish and "ONE-PEACE" features onepeace, respectively. We also illustrate the ground-truth event boundaries using solid bounding boxes of different colors. With LCF, the audio-visual features exhibit higher cross-modal similarity within event segments, reflecting improved semantic consistency. LCF also leads to reduced similarity between audio and visual features outside the annotated event spans, promoting better discrimination between relevant and irrelevant segments.
Figure 4: The impact of head number $H$ in Local Adaptive Cross-modal (LAC) Interaction on average mAP incorporating "ONE-PEACE" backbone onepeace.
Figure 5: The impact of parameter$\alpha$ on average mAP built upon "ONE-PEACE" features onepeace.
...and 1 more figures

Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

TL;DR

Abstract

Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

Authors

TL;DR

Abstract

Table of Contents

Figures (6)