Table of Contents
Fetching ...

ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization

Huilai Li, Yonghao Dang, Ying Xing, Yiming Wang, Jianqin Yin

TL;DR

Dense audio-visual event localization suffers from a semantic gap between modalities and limited modeling of co-occurring events. ESG-Net introduces an Early Semantic Interaction module for multi-stage cross-modal bridging and a Mixture of Dependency Experts to capture adaptive event dependencies, enabling progressive, event-focused fusion. The approach yields consistent performance improvements over state-of-the-art methods on UnAV-100 across backbones, with efficient parameter usage and clear ablation evidence of each component's value. These results suggest ESG-Net's potential to improve real-world dense event localization and pave the way for open-set, multi-modal downstream tasks.

Abstract

Dense audio-visual event localization (DAVE) aims to identify event categories and locate the temporal boundaries in untrimmed videos. Most studies only employ event-related semantic constraints on the final outputs, lacking cross-modal semantic bridging in intermediate layers. This causes modality semantic gap for further fusion, making it difficult to distinguish between event-related content and irrelevant background content. Moreover, they rarely consider the correlations between events, which limits the model to infer concurrent events among complex scenarios. In this paper, we incorporate multi-stage semantic guidance and multi-event relationship modeling, which respectively enable hierarchical semantic understanding of audio-visual events and adaptive extraction of event dependencies, thereby better focusing on event-related information. Specifically, our eventaware semantic guided network (ESG-Net) includes a early semantics interaction (ESI) module and a mixture of dependency experts (MoDE) module. ESI applys multi-stage semantic guidance to explicitly constrain the model in learning semantic information through multi-modal early fusion and several classification loss functions, ensuring hierarchical understanding of event-related content. MoDE promotes the extraction of multi-event dependencies through multiple serial mixture of experts with adaptive weight allocation. Extensive experiments demonstrate that our method significantly surpasses the state-of-the-art methods, while greatly reducing parameters and computational load. Our code will be released on https://github.com/uchiha99999/ESG-Net.

ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization

TL;DR

Dense audio-visual event localization suffers from a semantic gap between modalities and limited modeling of co-occurring events. ESG-Net introduces an Early Semantic Interaction module for multi-stage cross-modal bridging and a Mixture of Dependency Experts to capture adaptive event dependencies, enabling progressive, event-focused fusion. The approach yields consistent performance improvements over state-of-the-art methods on UnAV-100 across backbones, with efficient parameter usage and clear ablation evidence of each component's value. These results suggest ESG-Net's potential to improve real-world dense event localization and pave the way for open-set, multi-modal downstream tasks.

Abstract

Dense audio-visual event localization (DAVE) aims to identify event categories and locate the temporal boundaries in untrimmed videos. Most studies only employ event-related semantic constraints on the final outputs, lacking cross-modal semantic bridging in intermediate layers. This causes modality semantic gap for further fusion, making it difficult to distinguish between event-related content and irrelevant background content. Moreover, they rarely consider the correlations between events, which limits the model to infer concurrent events among complex scenarios. In this paper, we incorporate multi-stage semantic guidance and multi-event relationship modeling, which respectively enable hierarchical semantic understanding of audio-visual events and adaptive extraction of event dependencies, thereby better focusing on event-related information. Specifically, our eventaware semantic guided network (ESG-Net) includes a early semantics interaction (ESI) module and a mixture of dependency experts (MoDE) module. ESI applys multi-stage semantic guidance to explicitly constrain the model in learning semantic information through multi-modal early fusion and several classification loss functions, ensuring hierarchical understanding of event-related content. MoDE promotes the extraction of multi-event dependencies through multiple serial mixture of experts with adaptive weight allocation. Extensive experiments demonstrate that our method significantly surpasses the state-of-the-art methods, while greatly reducing parameters and computational load. Our code will be released on https://github.com/uchiha99999/ESG-Net.

Paper Structure

This paper contains 18 sections, 12 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: (a) Audio-visual events is intersection of visual and audio events, while significant differences exist between visual and audio content. (b) Most previous methods only performe semantic constraints on the final outputs. (c) We consider event-related semantic bridging between different modalities in intermediate layers through multi-stage fusion and semantic guidance.
  • Figure 2: Overview of ESG-Net. The original audio and video are fed into frozen encoders for feature extraction. Then, the ESI perform cross-modal early fusion and multi-stage semantic guidance to focus on event-related content in intermediate layers. Subsequently, the MoDE, which consists of multiple mixture of experts layers, is used to extract mutil-event dependencies from the integrated features. Finally, the audio-visual features are decoded to obtain event categories and temporal boundaries.
  • Figure 3: Qualitative results on the testing set. "GT" means ground truth, "Base" denotes the baseline model.
  • Figure 4: Qualitative results of the impact of multi-stage semantic guidance. Four examples demonstrate the attention of the model to events at different stages. The upper and lower rows in each example display the attention weight matrices of three stages (Single-Modal Attention, Audio(Visual)-Driven Mixture and Cross-Modal Pyramid) for the visual and audio branches. The red boxes indicate the audio-visual events in videos.
  • Figure 5: Qualitative results of average performance on co-occurring events with different overlap rates on the videos from the test set.