Table of Contents
Fetching ...

DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction

Kiana Hooshanfar, Alireza Hosseini, Ahmad Kalhor, Babak Nadjar Araabi

TL;DR

DTFSal tackles audio-visual video saliency prediction by jointly learning multi-scale visual representations and audio cues with three novel modules: LTEB, DLTFB, and AMFB. It uses a multi-scale visual encoder (MViTv2), an audio branch (VGGish), and a hierarchical multi-decoder to produce precise saliency maps, achieving state-of-the-art performance on six audio-visual benchmarks while maintaining computational efficiency. Ablation studies validate that LTEB sharpens salient tokens, DLTFB improves long-range and detailed spatial integration, and AMFB enables robust, multi-stream fusion of modalities. The results demonstrate strong cross-modal alignment benefits and practical applicability for real-time or resource-constrained scenarios.

Abstract

Audio-visual saliency prediction aims to mimic human visual attention by identifying salient regions in videos through the integration of both visual and auditory information. Although visual-only approaches have significantly advanced, effectively incorporating auditory cues remains challenging due to complex spatio-temporal interactions and high computational demands. To address these challenges, we propose Dynamic Token Fusion Saliency (DFTSal), a novel audio-visual saliency prediction framework designed to balance accuracy with computational efficiency. Our approach features a multi-scale visual encoder equipped with two novel modules: the Learnable Token Enhancement Block (LTEB), which adaptively weights tokens to emphasize crucial saliency cues, and the Dynamic Learnable Token Fusion Block (DLTFB), which employs a shifting operation to reorganize and merge features, effectively capturing long-range dependencies and detailed spatial information. In parallel, an audio branch processes raw audio signals to extract meaningful auditory features. Both visual and audio features are integrated using our Adaptive Multimodal Fusion Block (AMFB), which employs local, global, and adaptive fusion streams for precise cross-modal fusion. The resulting fused features are processed by a hierarchical multi-decoder structure, producing accurate saliency maps. Extensive evaluations on six audio-visual benchmarks demonstrate that DFTSal achieves SOTA performance while maintaining computational efficiency.

DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction

TL;DR

DTFSal tackles audio-visual video saliency prediction by jointly learning multi-scale visual representations and audio cues with three novel modules: LTEB, DLTFB, and AMFB. It uses a multi-scale visual encoder (MViTv2), an audio branch (VGGish), and a hierarchical multi-decoder to produce precise saliency maps, achieving state-of-the-art performance on six audio-visual benchmarks while maintaining computational efficiency. Ablation studies validate that LTEB sharpens salient tokens, DLTFB improves long-range and detailed spatial integration, and AMFB enables robust, multi-stream fusion of modalities. The results demonstrate strong cross-modal alignment benefits and practical applicability for real-time or resource-constrained scenarios.

Abstract

Audio-visual saliency prediction aims to mimic human visual attention by identifying salient regions in videos through the integration of both visual and auditory information. Although visual-only approaches have significantly advanced, effectively incorporating auditory cues remains challenging due to complex spatio-temporal interactions and high computational demands. To address these challenges, we propose Dynamic Token Fusion Saliency (DFTSal), a novel audio-visual saliency prediction framework designed to balance accuracy with computational efficiency. Our approach features a multi-scale visual encoder equipped with two novel modules: the Learnable Token Enhancement Block (LTEB), which adaptively weights tokens to emphasize crucial saliency cues, and the Dynamic Learnable Token Fusion Block (DLTFB), which employs a shifting operation to reorganize and merge features, effectively capturing long-range dependencies and detailed spatial information. In parallel, an audio branch processes raw audio signals to extract meaningful auditory features. Both visual and audio features are integrated using our Adaptive Multimodal Fusion Block (AMFB), which employs local, global, and adaptive fusion streams for precise cross-modal fusion. The resulting fused features are processed by a hierarchical multi-decoder structure, producing accurate saliency maps. Extensive evaluations on six audio-visual benchmarks demonstrate that DFTSal achieves SOTA performance while maintaining computational efficiency.

Paper Structure

This paper contains 19 sections, 19 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Overview of our DTFSal model, which integrates a multi-scale encoder, a hierarchical multi-decoder, LTEB, DLTFB, and AMFB for efficient and accurate audio-visual saliency prediction.
  • Figure 2: Overview of LTEB and DLTFB: (a) LTEB emphasizes salient cues by adaptively weighting tokens. (b) DLTFB enhances spatial feature integration by shifting and fusing tokens.
  • Figure 3: Overview of AMFB. (a) AMFB integrates local, global, and adaptive streams to fuse audiovisual cues. (b) Tri-Stream Score module computes attention weights via GAP, a linear layer, and sigmoid activation.
  • Figure 4: Comparative visualizations of our DTFSal model compared with previous SOTA audio-visual saliency prediction methods.
  • Figure 5: Additional comparative visualizations of our DTFSal model compared with previous SOTA audio-visual saliency prediction methods.
  • ...and 2 more figures