Table of Contents
Fetching ...

Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization

Yuanyuan Jiang, Jianqin Yin, Yonghao Dang

TL;DR

The paper tackles audio-visual event localization by introducing a video-level semantic consistency framework that complements segment-level representations. It presents the ESCM module, comprising a cross-modal event representation extractor (CERE) and an intra-modal semantic consistency enhancer (ISCE), along with a negative pair filter loss for fully supervised and a smooth loss for weakly supervised learning. Empirical results on the AVE dataset show state-of-the-art performance in both settings, with substantial gains from leveraging video-level semantics and improved robustness to background noise. The approach advances multimodal understanding by modeling the temporal coherence of events across modalities and within each modality, enabling more accurate localization and categorization of AVEs with improved efficiency.

Abstract

Audio-visual event (AVE) localization has attracted much attention in recent years. Most existing methods are often limited to independently encoding and classifying each video segment separated from the full video (which can be regarded as the segment-level representations of events). However, they ignore the semantic consistency of the event within the same full video (which can be considered as the video-level representations of events). In contrast to existing methods, we propose a novel video-level semantic consistency guidance network for the AVE localization task. Specifically, we propose an event semantic consistency modeling (ESCM) module to explore video-level semantic information for semantic consistency modeling. It consists of two components: a cross-modal event representation extractor (CERE) and an intra-modal semantic consistency enhancer (ISCE). CERE is proposed to obtain the event semantic information at the video level. Furthermore, ISCE takes video-level event semantics as prior knowledge to guide the model to focus on the semantic continuity of an event within each modality. Moreover, we propose a new negative pair filter loss to encourage the network to filter out the irrelevant segment pairs and a new smooth loss to further increase the gap between different categories of events in the weakly-supervised setting. We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings, thus verifying the effectiveness of our method.The code is available at https://github.com/Bravo5542/VSCG.

Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization

TL;DR

The paper tackles audio-visual event localization by introducing a video-level semantic consistency framework that complements segment-level representations. It presents the ESCM module, comprising a cross-modal event representation extractor (CERE) and an intra-modal semantic consistency enhancer (ISCE), along with a negative pair filter loss for fully supervised and a smooth loss for weakly supervised learning. Empirical results on the AVE dataset show state-of-the-art performance in both settings, with substantial gains from leveraging video-level semantics and improved robustness to background noise. The approach advances multimodal understanding by modeling the temporal coherence of events across modalities and within each modality, enabling more accurate localization and categorization of AVEs with improved efficiency.

Abstract

Audio-visual event (AVE) localization has attracted much attention in recent years. Most existing methods are often limited to independently encoding and classifying each video segment separated from the full video (which can be regarded as the segment-level representations of events). However, they ignore the semantic consistency of the event within the same full video (which can be considered as the video-level representations of events). In contrast to existing methods, we propose a novel video-level semantic consistency guidance network for the AVE localization task. Specifically, we propose an event semantic consistency modeling (ESCM) module to explore video-level semantic information for semantic consistency modeling. It consists of two components: a cross-modal event representation extractor (CERE) and an intra-modal semantic consistency enhancer (ISCE). CERE is proposed to obtain the event semantic information at the video level. Furthermore, ISCE takes video-level event semantics as prior knowledge to guide the model to focus on the semantic continuity of an event within each modality. Moreover, we propose a new negative pair filter loss to encourage the network to filter out the irrelevant segment pairs and a new smooth loss to further increase the gap between different categories of events in the weakly-supervised setting. We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings, thus verifying the effectiveness of our method.The code is available at https://github.com/Bravo5542/VSCG.
Paper Structure (22 sections, 21 equations, 10 figures, 6 tables)

This paper contains 22 sections, 21 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Illustration example of the AVE localization task. The red box denotes the segment containing an AVE in which the sounding object is visible and the sound is audible. In the example, only when we see the dog and hear it barking can we localize them (the 5th to 10th segments) as a "barking dog" AVE, and the remaining are recognized as background. "AV Event" denotes the ground truth label. The bottom four rows are the predicted results of AVELtian2018audio, PSPzhou2021positive, CMBSxia2022cross, and our method.
  • Figure 2: Comparison of our video-level semantic guiding approach with previous segment-level encoding approaches. The semantics of the audio-visual content of the first two segments and the visual content of the last three segments are ambiguous, causing the previous method to mistake them for cat or goat. However, the video-level AVE representation has more discriminative semantics, which will help to locate the remaining segments more accurately.
  • Figure 3: The proposed video-level semantic consistency guidance network. (a) The main pipeline of our model. The joint audio-visual learning consists of two parts: the feature encoding at the segment level consisting of audio-guided visual attention, LSTM and the PSPzhou2021positive and the semantic guiding at the video level achieved by our proposed event semantic consistency modeling (ESCM) module. (b) Illustration of the cross-modal event representation extractor (CERE) module. We utilize 1D convolutional networks to aggregate all video segments to obtain the video-level event semantic representations. (c) Illustration of the ESCM module consists of CERE and ISCE (intra-modal semantic consistency enhancer), and note that the illustrated CERE modules are shared between audio and visual modalities.
  • Figure 4: A qualitative example of AVE localization in a visually obscured scene. For this video, all ten segments contain the visual and audio signals of the "ringing church bell" event. We visualize the visual attention in the image stream. It is clear that our method produces more accurate localization and that our attended regions better overlap with the sound sources. We choose intermediate image frames for visualization as an abstract representation of the segment.
  • Figure 5: A qualitative example of AVE localization in a multi-source scene. For the visualization results on the left, the first column shows two large bells on the left and right sides of the scene; the second column shows that there are three people in the image, but the woman in the middle does not make any sound; in the third column, the water hits the upper wall of the sink and then naturally flows down to hit the lower wall, so there are two sounding points.
  • ...and 5 more figures