TSalV360: A Method and Dataset for Text-driven Saliency Detection in 360-Degrees Videos
Ioannis Kontostathis, Evlampios Apostolidis, Vasileios Mezaris
TL;DR
This work addresses the need for text-guided saliency detection in 360-degree videos by introducing TSV360, a dataset with roughly 16,000 ERP-frame triplets (ERP frames, descriptive text, and ground-truth saliency maps) built from diverse 360-degree content. It then proposes TSalV360, a multimodal extension of SalViT360 that leverages a vision-language backbone, a similarity Estimation module (SimEst) to weight visual features by text relevance, and a viewport spatio-temporal cross-attention (VSTCA) mechanism to fuse visual and textual data across tangent viewports. Comprehensive experiments on TSV360—including 5-fold cross-validation and ablation studies—demonstrate that TSalV360 outperforms the visual-only baseline and that each architectural component (VSTCA, CLIP-based representations, SimEst, and hierarchical skips) contributes to improved text-conditioned saliency. This work enables customized, text-driven navigation and analysis of immersive 360-degree content, with implications for enhanced VR experiences and targeted content consumption.
Abstract
In this paper, we deal with the task of text-driven saliency detection in 360-degrees videos. For this, we introduce the TSV360 dataset which includes 16,000 triplets of ERP frames, textual descriptions of salient objects/events in these frames, and the associated ground-truth saliency maps. Following, we extend and adapt a SOTA visual-based approach for 360-degrees video saliency detection, and develop the TSalV360 method that takes into account a user-provided text description of the desired objects and/or events. This method leverages a SOTA vision-language model for data representation and integrates a similarity estimation module and a viewport spatio-temporal cross-attention mechanism, to discover dependencies between the different data modalities. Quantitative and qualitative evaluations using the TSV360 dataset, showed the competitiveness of TSalV360 compared to a SOTA visual-based approach and documented its competency to perform customized text-driven saliency detection in 360-degrees videos.
