Table of Contents
Fetching ...

CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion

Yolo Yunlong Tang, Gen Zhan, Li Yang, Yiting Liao, Chenliang Xu

TL;DR

This paper tackles video saliency prediction by leveraging language-driven reasoning to produce a salient-object ranking that guides saliency map decoding. The authors propose CaRDiff, a framework that combines an MLLM with VSOR-CoT, a grounding module, and a diffusion model, trained in three stages to align modalities, tune chain-of-thought reasoning, and learn diffusion-based saliency decoding. The method yields state-of-the-art performance on the MVS dataset and strong zero-shot generalization to DHF1k, with ablations confirming the importance of ranking maps and VSOR-CoT. By encoding ranking information into ranking maps and conditioning diffusion on these semantic cues, CaRDiff demonstrates the practical value of high-level semantics in perceptual saliency prediction.

Abstract

Video saliency prediction aims to identify the regions in a video that attract human attention and gaze, driven by bottom-up features from the video and top-down processes like memory and cognition. Among these top-down influences, language plays a crucial role in guiding attention by shaping how visual information is interpreted. Existing methods primarily focus on modeling perceptual information while neglecting the reasoning process facilitated by language, where ranking cues are crucial outcomes of this process and practical guidance for saliency prediction. In this paper, we propose CaRDiff (Caption, Rank, and generate with Diffusion), a framework that imitates the process by integrating a multimodal large language model (MLLM), a grounding module, and a diffusion model, to enhance video saliency prediction. Specifically, we introduce a novel prompting method VSOR-CoT (Video Salient Object Ranking Chain of Thought), which utilizes an MLLM with a grounding module to caption video content and infer salient objects along with their rankings and positions. This process derives ranking maps that can be sufficiently leveraged by the diffusion model to decode the saliency maps for the given video accurately. Extensive experiments show the effectiveness of VSOR-CoT in improving the performance of video saliency prediction. The proposed CaRDiff performs better than state-of-the-art models on the MVS dataset and demonstrates cross-dataset capabilities on the DHF1k dataset through zero-shot evaluation.

CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion

TL;DR

This paper tackles video saliency prediction by leveraging language-driven reasoning to produce a salient-object ranking that guides saliency map decoding. The authors propose CaRDiff, a framework that combines an MLLM with VSOR-CoT, a grounding module, and a diffusion model, trained in three stages to align modalities, tune chain-of-thought reasoning, and learn diffusion-based saliency decoding. The method yields state-of-the-art performance on the MVS dataset and strong zero-shot generalization to DHF1k, with ablations confirming the importance of ranking maps and VSOR-CoT. By encoding ranking information into ranking maps and conditioning diffusion on these semantic cues, CaRDiff demonstrates the practical value of high-level semantics in perceptual saliency prediction.

Abstract

Video saliency prediction aims to identify the regions in a video that attract human attention and gaze, driven by bottom-up features from the video and top-down processes like memory and cognition. Among these top-down influences, language plays a crucial role in guiding attention by shaping how visual information is interpreted. Existing methods primarily focus on modeling perceptual information while neglecting the reasoning process facilitated by language, where ranking cues are crucial outcomes of this process and practical guidance for saliency prediction. In this paper, we propose CaRDiff (Caption, Rank, and generate with Diffusion), a framework that imitates the process by integrating a multimodal large language model (MLLM), a grounding module, and a diffusion model, to enhance video saliency prediction. Specifically, we introduce a novel prompting method VSOR-CoT (Video Salient Object Ranking Chain of Thought), which utilizes an MLLM with a grounding module to caption video content and infer salient objects along with their rankings and positions. This process derives ranking maps that can be sufficiently leveraged by the diffusion model to decode the saliency maps for the given video accurately. Extensive experiments show the effectiveness of VSOR-CoT in improving the performance of video saliency prediction. The proposed CaRDiff performs better than state-of-the-art models on the MVS dataset and demonstrates cross-dataset capabilities on the DHF1k dataset through zero-shot evaluation.
Paper Structure (39 sections, 15 equations, 8 figures, 4 tables)

This paper contains 39 sections, 15 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Given (a) the input video, CaRDiff generates (b) video captions and (c) salient objects ranking via VSOR-CoT. These create (d) ranking maps that guide the diffusion model, resulting in (e) saliency predictions, which show accuracy compared to (f) ground-truth saliency maps.
  • Figure 2: The pipeline of data curation.
  • Figure 3: The proposed CaRDiff consists of an MLLM with VSOR-CoT, a grounding module, and a diffusion model.
  • Figure 4: Results Visualization. Our CaRDiff shows advantages across multiple state-of-the-art models, especially in videos with rich content and complex scenarios. More results visualized can be found in the Appendix.
  • Figure 5: (a) Ranking Map Ratio Experiments. (b) and (c): analysis of the ranking-saliency correlation in CaRDiff, illustrating the high correlation between the predicted ranking maps and saliency maps. (b) and (c) show map correlation and rank correlation, respectively.
  • ...and 3 more figures