Table of Contents
Fetching ...

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction

Junwen Xiong, Peng Zhang, Tao You, Chuanyue Li, Wei Huang, Yufei Zha

TL;DR

A novel Diffusion architecture for generalized audio-visual Saliency prediction (DiffSal) is proposed in this work, which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions.

Abstract

Audio-visual saliency prediction can draw support from diverse modality complements, but further performance enhancement is still challenged by customized architectures as well as task-specific loss functions. In recent studies, denoising diffusion models have shown more promising in unifying task frameworks owing to their inherent ability of generalization. Following this motivation, a novel Diffusion architecture for generalized audio-visual Saliency prediction (DiffSal) is proposed in this work, which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions. Based on the spatio-temporal audio-visual features, an extra network Saliency-UNet is designed to perform multi-modal attention modulation for progressive refinement of the ground-truth saliency map from the noisy map. Extensive experiments demonstrate that the proposed DiffSal can achieve excellent performance across six challenging audio-visual benchmarks, with an average relative improvement of 6.3\% over the previous state-of-the-art results by six metrics.

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction

TL;DR

A novel Diffusion architecture for generalized audio-visual Saliency prediction (DiffSal) is proposed in this work, which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions.

Abstract

Audio-visual saliency prediction can draw support from diverse modality complements, but further performance enhancement is still challenged by customized architectures as well as task-specific loss functions. In recent studies, denoising diffusion models have shown more promising in unifying task frameworks owing to their inherent ability of generalization. Following this motivation, a novel Diffusion architecture for generalized audio-visual Saliency prediction (DiffSal) is proposed in this work, which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions. Based on the spatio-temporal audio-visual features, an extra network Saliency-UNet is designed to perform multi-modal attention modulation for progressive refinement of the ground-truth saliency map from the noisy map. Extensive experiments demonstrate that the proposed DiffSal can achieve excellent performance across six challenging audio-visual benchmarks, with an average relative improvement of 6.3\% over the previous state-of-the-art results by six metrics.
Paper Structure (20 sections, 10 equations, 9 figures, 8 tables, 2 algorithms)

This paper contains 20 sections, 10 equations, 9 figures, 8 tables, 2 algorithms.

Figures (9)

  • Figure 1: Comparison of conventional audio-visual saliency prediction paradigms and our proposed diffusion-based approach. Both the localization-based and 3D convolution-based methods use tailored network structures and sophisticated loss functions to predict saliency areas. Differently, our diffusion-based approach is a generalized audio-visual saliency prediction framework using simple MSE objective function.
  • Figure 2: An overview of the proposed DiffSal framework. DiffSal first encodes spatio-temporal video features $\textbf{f}_v$ and audio features $\textbf{f}_a$ by the Video and Audio Encoders, respectively. Then the Saliency-UNet takes audio features $\textbf{f}_a$ and video features $\textbf{f}_v$ as the conditions to guide the network in generating the saliency map $\hat{S}_0$ from the noisy map $S_t$.
  • Figure 3: Visualizing the saliency results when different modalities are used. The audio-only approach can localize the sound source coming from the performer's guitar, while the video-only approach focuses on both the performer's face as well as the guitar.
  • Figure 4: Performance analysis of denoising steps on AVAD and DIEM datasets.
  • Figure 5: Qualitative results of our method compared with other state-of-the-art works. Challenging scenarios involving fast movement on the tennis court and multiple speakers indoors.
  • ...and 4 more figures