Table of Contents
Fetching ...

Relevance-guided Audio Visual Fusion for Video Saliency Prediction

Li Yu, Xuanzhe Sun, Pan Gao, Moncef Gabbouj

TL;DR

A novel relevance-guided audio-visual saliency prediction network dubbed AVRSP that dynamically adjusts the retention of audio features based on the semantic relevance between audio and visual elements, thereby refining the integration process with visual features.

Abstract

Audio data, often synchronized with video frames, plays a crucial role in guiding the audience's visual attention. Incorporating audio information into video saliency prediction tasks can enhance the prediction of human visual behavior. However, existing audio-visual saliency prediction methods often directly fuse audio and visual features, which ignore the possibility of inconsistency between the two modalities, such as when the audio serves as background music. To address this issue, we propose a novel relevance-guided audio-visual saliency prediction network dubbed AVRSP. Specifically, the Relevance-guided Audio-Visual feature Fusion module (RAVF) dynamically adjusts the retention of audio features based on the semantic relevance between audio and visual elements, thereby refining the integration process with visual features. Furthermore, the Multi-scale feature Synergy (MS) module integrates visual features from different encoding stages, enhancing the network's ability to represent objects at various scales. The Multi-scale Regulator Gate (MRG) could transfer crucial fusion information to visual features, thus optimizing the utilization of multi-scale visual features. Extensive experiments on six audio-visual eye movement datasets have demonstrated that our AVRSP network achieves competitive performance in audio-visual saliency prediction.

Relevance-guided Audio Visual Fusion for Video Saliency Prediction

TL;DR

A novel relevance-guided audio-visual saliency prediction network dubbed AVRSP that dynamically adjusts the retention of audio features based on the semantic relevance between audio and visual elements, thereby refining the integration process with visual features.

Abstract

Audio data, often synchronized with video frames, plays a crucial role in guiding the audience's visual attention. Incorporating audio information into video saliency prediction tasks can enhance the prediction of human visual behavior. However, existing audio-visual saliency prediction methods often directly fuse audio and visual features, which ignore the possibility of inconsistency between the two modalities, such as when the audio serves as background music. To address this issue, we propose a novel relevance-guided audio-visual saliency prediction network dubbed AVRSP. Specifically, the Relevance-guided Audio-Visual feature Fusion module (RAVF) dynamically adjusts the retention of audio features based on the semantic relevance between audio and visual elements, thereby refining the integration process with visual features. Furthermore, the Multi-scale feature Synergy (MS) module integrates visual features from different encoding stages, enhancing the network's ability to represent objects at various scales. The Multi-scale Regulator Gate (MRG) could transfer crucial fusion information to visual features, thus optimizing the utilization of multi-scale visual features. Extensive experiments on six audio-visual eye movement datasets have demonstrated that our AVRSP network achieves competitive performance in audio-visual saliency prediction.

Paper Structure

This paper contains 17 sections, 16 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The visualization results of the saliency prediction in a multimodal setting. For the video sequence on the left, which includes background music, our model minimizes the influence of irrelevant audio and focuses on key visual elements. For the video sequence on the right, the former half features a narration by a woman, during which our model prioritizes her presence. In the latter half, where only bird humming is present, our model shifts its attention to the bird.
  • Figure 2: The proposed AVRSP, mainly consists of three main stages: (1) Audio and Visual Feature Extraction, where audio waveforms and frame sequences are encoded using SoundNet and S3D models respectively, (2) Multi-scale Feature Enhancement & Audio-Visual Feature Fusion, where extracted features undergo dynamic fusion through the Relevance-guided Audio-Visual Fusion (RAVF) module. The Multi-Scale feature Synergy (MS) module along with the Multi-scale Regulator Gate (MRG) adjust and enhance feature interplay, (3) Saliency Prediction, and several saliency decoder blocks are used to estimate the saliency map from the multi-scale audio-visual features.
  • Figure 3: The illustration of the Relevance-Guided Audio-Visual Fusion (RAVF) method.
  • Figure 4: Multi-scale feature Synergy and Multi-scale Regulator Gate.
  • Figure 5: Sample frame from Coutrot1, Coutrot2 and DIEM databases with their eye-tracking data, and the corresponding ground truth, AVRSP, and other state-of-the-art audio-visual saliency maps for comparisons.
  • ...and 1 more figures