Text-Audio-Visual-conditioned Diffusion Model for Video Saliency Prediction

Li Yu; Xuanzhe Sun; Wei Zhou; Moncef Gabbouj

Text-Audio-Visual-conditioned Diffusion Model for Video Saliency Prediction

Li Yu, Xuanzhe Sun, Wei Zhou, Moncef Gabbouj

TL;DR

This work addresses video saliency prediction by introducing TAVDiff, a diffusion-model framework conditioned on text, audio, and visual inputs. It advances the field with two key components: SITR, which provides text-driven semantic guidance via cross-attention, and Saliency-DiT, a decoupled, multimodal denoiser that fuses conditional information through cross-attention during denoising. Empirical results across six audiovisual datasets show state-of-the-art performance, with notable gains in SIM, CC, NSS, and AUC-J metrics, and comprehensive ablations validate the design choices. The approach demonstrates the strong potential of tri-modal conditioning for robust, semantically coherent saliency prediction in complex video scenes, with practical implications for video compression and human-centered AI systems.

Abstract

Video saliency prediction is crucial for downstream applications, such as video compression and human-computer interaction. With the flourishing of multimodal learning, researchers started to explore multimodal video saliency prediction, including audio-visual and text-visual approaches. Auditory cues guide the gaze of viewers to sound sources, while textual cues provide semantic guidance for understanding video content. Integrating these complementary cues can improve the accuracy of saliency prediction. Therefore, we attempt to simultaneously analyze visual, auditory, and textual modalities in this paper, and propose TAVDiff, a Text-Audio-Visual-conditioned Diffusion Model for video saliency prediction. TAVDiff treats video saliency prediction as an image generation task conditioned on textual, audio, and visual inputs, and predicts saliency maps through stepwise denoising. To effectively utilize text, a large multimodal model is used to generate textual descriptions for video frames and introduce a saliency-oriented image-text response (SITR) mechanism to generate image-text response maps. It is used as conditional information to guide the model to localize the visual regions that are semantically related to the textual description. Regarding the auditory modality, it is used as another conditional information for directing the model to focus on salient regions indicated by sounds. At the same time, since the diffusion transformer (DiT) directly concatenates the conditional information with the timestep, which may affect the estimation of the noise level. To achieve effective conditional guidance, we propose Saliency-DiT, which decouples the conditional information from the timestep. Experimental results show that TAVDiff outperforms existing methods, improving 1.03\%, 2.35\%, 2.71\% and 0.33\% on SIM, CC, NSS and AUC-J metrics, respectively.

Text-Audio-Visual-conditioned Diffusion Model for Video Saliency Prediction

TL;DR

Abstract

Text-Audio-Visual-conditioned Diffusion Model for Video Saliency Prediction

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)