Table of Contents
Fetching ...

Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration

Ziheng Zhou, Jinxing Zhou, Wei Qian, Shengeng Tang, Xiaojun Chang, Dan Guo

TL;DR

The paper tackles dense audio-visual event localization in long, untrimmed videos by introducing CCNet, which combines Cross-Modal Consistency Collaboration (CMCC) and Multi-Temporal Granularity Collaboration (MTGC). CMCC jointly enriches cross-modal semantics and enforces temporal consistency between audio and visual streams, while MTGC enables bidirectional information flow across multiple temporal scales to capture events with diverse durations. On UnAV-100, CCNet achieves state-of-the-art mean Average Precision across several $tIoU$ thresholds and demonstrates robustness across hand-crafted ablations, modalities, and event durations, with qualitative analyses showing effective temporal gating and cross-scale reasoning. These contributions advance practical dense AV scene understanding and provide a scalable framework for multi-modal localization in untrimmed videos, with potential applicability to related datasets like ActivityNet1.3 and LFAV.

Abstract

In the field of audio-visual learning, most research tasks focus exclusively on short videos. This paper focuses on the more practical Dense Audio-Visual Event Localization (DAVEL) task, advancing audio-visual scene understanding for longer, untrimmed videos. This task seeks to identify and temporally pinpoint all events simultaneously occurring in both audio and visual streams. Typically, each video encompasses dense events of multiple classes, which may overlap on the timeline, each exhibiting varied durations. Given these challenges, effectively exploiting the audio-visual relations and the temporal features encoded at various granularities becomes crucial. To address these challenges, we introduce a novel CCNet, comprising two core modules: the Cross-Modal Consistency Collaboration (CMCC) and the Multi-Temporal Granularity Collaboration (MTGC). Specifically, the CMCC module contains two branches: a cross-modal interaction branch and a temporal consistency-gated branch. The former branch facilitates the aggregation of consistent event semantics across modalities through the encoding of audio-visual relations, while the latter branch guides one modality's focus to pivotal event-relevant temporal areas as discerned in the other modality. The MTGC module includes a coarse-to-fine collaboration block and a fine-to-coarse collaboration block, providing bidirectional support among coarse- and fine-grained temporal features. Extensive experiments on the UnAV-100 dataset validate our module design, resulting in a new state-of-the-art performance in dense audio-visual event localization. The code is available at https://github.com/zzhhfut/CCNet-AAAI2025.

Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration

TL;DR

The paper tackles dense audio-visual event localization in long, untrimmed videos by introducing CCNet, which combines Cross-Modal Consistency Collaboration (CMCC) and Multi-Temporal Granularity Collaboration (MTGC). CMCC jointly enriches cross-modal semantics and enforces temporal consistency between audio and visual streams, while MTGC enables bidirectional information flow across multiple temporal scales to capture events with diverse durations. On UnAV-100, CCNet achieves state-of-the-art mean Average Precision across several thresholds and demonstrates robustness across hand-crafted ablations, modalities, and event durations, with qualitative analyses showing effective temporal gating and cross-scale reasoning. These contributions advance practical dense AV scene understanding and provide a scalable framework for multi-modal localization in untrimmed videos, with potential applicability to related datasets like ActivityNet1.3 and LFAV.

Abstract

In the field of audio-visual learning, most research tasks focus exclusively on short videos. This paper focuses on the more practical Dense Audio-Visual Event Localization (DAVEL) task, advancing audio-visual scene understanding for longer, untrimmed videos. This task seeks to identify and temporally pinpoint all events simultaneously occurring in both audio and visual streams. Typically, each video encompasses dense events of multiple classes, which may overlap on the timeline, each exhibiting varied durations. Given these challenges, effectively exploiting the audio-visual relations and the temporal features encoded at various granularities becomes crucial. To address these challenges, we introduce a novel CCNet, comprising two core modules: the Cross-Modal Consistency Collaboration (CMCC) and the Multi-Temporal Granularity Collaboration (MTGC). Specifically, the CMCC module contains two branches: a cross-modal interaction branch and a temporal consistency-gated branch. The former branch facilitates the aggregation of consistent event semantics across modalities through the encoding of audio-visual relations, while the latter branch guides one modality's focus to pivotal event-relevant temporal areas as discerned in the other modality. The MTGC module includes a coarse-to-fine collaboration block and a fine-to-coarse collaboration block, providing bidirectional support among coarse- and fine-grained temporal features. Extensive experiments on the UnAV-100 dataset validate our module design, resulting in a new state-of-the-art performance in dense audio-visual event localization. The code is available at https://github.com/zzhhfut/CCNet-AAAI2025.

Paper Structure

This paper contains 18 sections, 9 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Illustration of the Dense Audio-Visual Event Localization (DAVEL) task. The DAVEL task requires temporally localizing the events that occur simultaneously in both audio and visual tracks of untrimmed videos. These dense events may overlap on the timeline and vary in duration. "GT” denotes the ground truth for audio-visual events, which are the intersection of audio events ("A-E”) and visual events ("V-E”).
  • Figure 2: The pipeline of our CCNet for the dense audio-visual event localization task.
  • Figure 3: Localization results for various durations.
  • Figure 4: Qualitative examples of dense audio-visual event localization.