Table of Contents
Fetching ...

TSalV360: A Method and Dataset for Text-driven Saliency Detection in 360-Degrees Videos

Ioannis Kontostathis, Evlampios Apostolidis, Vasileios Mezaris

TL;DR

This work addresses the need for text-guided saliency detection in 360-degree videos by introducing TSV360, a dataset with roughly 16,000 ERP-frame triplets (ERP frames, descriptive text, and ground-truth saliency maps) built from diverse 360-degree content. It then proposes TSalV360, a multimodal extension of SalViT360 that leverages a vision-language backbone, a similarity Estimation module (SimEst) to weight visual features by text relevance, and a viewport spatio-temporal cross-attention (VSTCA) mechanism to fuse visual and textual data across tangent viewports. Comprehensive experiments on TSV360—including 5-fold cross-validation and ablation studies—demonstrate that TSalV360 outperforms the visual-only baseline and that each architectural component (VSTCA, CLIP-based representations, SimEst, and hierarchical skips) contributes to improved text-conditioned saliency. This work enables customized, text-driven navigation and analysis of immersive 360-degree content, with implications for enhanced VR experiences and targeted content consumption.

Abstract

In this paper, we deal with the task of text-driven saliency detection in 360-degrees videos. For this, we introduce the TSV360 dataset which includes 16,000 triplets of ERP frames, textual descriptions of salient objects/events in these frames, and the associated ground-truth saliency maps. Following, we extend and adapt a SOTA visual-based approach for 360-degrees video saliency detection, and develop the TSalV360 method that takes into account a user-provided text description of the desired objects and/or events. This method leverages a SOTA vision-language model for data representation and integrates a similarity estimation module and a viewport spatio-temporal cross-attention mechanism, to discover dependencies between the different data modalities. Quantitative and qualitative evaluations using the TSV360 dataset, showed the competitiveness of TSalV360 compared to a SOTA visual-based approach and documented its competency to perform customized text-driven saliency detection in 360-degrees videos.

TSalV360: A Method and Dataset for Text-driven Saliency Detection in 360-Degrees Videos

TL;DR

This work addresses the need for text-guided saliency detection in 360-degree videos by introducing TSV360, a dataset with roughly 16,000 ERP-frame triplets (ERP frames, descriptive text, and ground-truth saliency maps) built from diverse 360-degree content. It then proposes TSalV360, a multimodal extension of SalViT360 that leverages a vision-language backbone, a similarity Estimation module (SimEst) to weight visual features by text relevance, and a viewport spatio-temporal cross-attention (VSTCA) mechanism to fuse visual and textual data across tangent viewports. Comprehensive experiments on TSV360—including 5-fold cross-validation and ablation studies—demonstrate that TSalV360 outperforms the visual-only baseline and that each architectural component (VSTCA, CLIP-based representations, SimEst, and hierarchical skips) contributes to improved text-conditioned saliency. This work enables customized, text-driven navigation and analysis of immersive 360-degree content, with implications for enhanced VR experiences and targeted content consumption.

Abstract

In this paper, we deal with the task of text-driven saliency detection in 360-degrees videos. For this, we introduce the TSV360 dataset which includes 16,000 triplets of ERP frames, textual descriptions of salient objects/events in these frames, and the associated ground-truth saliency maps. Following, we extend and adapt a SOTA visual-based approach for 360-degrees video saliency detection, and develop the TSalV360 method that takes into account a user-provided text description of the desired objects and/or events. This method leverages a SOTA vision-language model for data representation and integrates a similarity estimation module and a viewport spatio-temporal cross-attention mechanism, to discover dependencies between the different data modalities. Quantitative and qualitative evaluations using the TSV360 dataset, showed the competitiveness of TSalV360 compared to a SOTA visual-based approach and documented its competency to perform customized text-driven saliency detection in 360-degrees videos.

Paper Structure

This paper contains 13 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: An overview of the proposed TSalV360 method (a), along with detailed presentations of the introduced VSTCA mechanism (b) and the implemented cross-attention (c) within it.
  • Figure 2: An overview of the performed methodology for producing the TSV360 dataset.
  • Figure 3: Qualitative comparisons between the ground truth and the predicted saliency map generated by TSalV360 in an indoor scene.