Table of Contents
Fetching ...

How is Visual Attention Influenced by Text Guidance? Database and Model

Yinan Sun, Xiongkuo Min, Huiyu Duan, Guangtao Zhai

TL;DR

This work tackles how text descriptions influence visual attention by building the SJTU-TIS text-guided saliency database and proposing TGSal, a multimodal saliency predictor that fuses image features from ResNet-50 with text features from a CLIP encoder. The model employs global and local text feature fusion plus hierarchical refinement to generate saliency maps under both pure-image and text-guided conditions, and it outperforms state-of-the-art unimodal and multimodal baselines on SALICON and the new SJTU-TIS benchmarks. Key contributions include (i) the first text-guided image saliency database with ground-truth eye-tracking data for multiple text conditions, (ii) a robust TGSal architecture with dedicated global/local fusion modules and a tailored loss combining $CC$ and $MSE$ terms, and (iii) extensive ablations and cross-dataset analyses demonstrating the practical value of text–image fusion for saliency prediction. The results suggest strong potential for text-conditioned visual analytics and multimodal saliency applications, with code and data to be released for reproducibility and further research.

Abstract

The analysis and prediction of visual attention have long been crucial tasks in the fields of computer vision and image processing. In practical applications, images are generally accompanied by various text descriptions, however, few studies have explored the influence of text descriptions on visual attention, let alone developed visual saliency prediction models considering text guidance. In this paper, we conduct a comprehensive study on text-guided image saliency (TIS) from both subjective and objective perspectives. Specifically, we construct a TIS database named SJTU-TIS, which includes 1200 text-image pairs and the corresponding collected eye-tracking data. Based on the established SJTU-TIS database, we analyze the influence of various text descriptions on visual attention. Then, to facilitate the development of saliency prediction models considering text influence, we construct a benchmark for the established SJTU-TIS database using state-of-the-art saliency models. Finally, considering the effect of text descriptions on visual attention, while most existing saliency models ignore this impact, we further propose a text-guided saliency (TGSal) prediction model, which extracts and integrates both image features and text features to predict the image saliency under various text-description conditions. Our proposed model significantly outperforms the state-of-the-art saliency models on both the SJTU-TIS database and the pure image saliency databases in terms of various evaluation metrics. The SJTU-TIS database and the code of the proposed TGSal model will be released at: https://github.com/IntMeGroup/TGSal.

How is Visual Attention Influenced by Text Guidance? Database and Model

TL;DR

This work tackles how text descriptions influence visual attention by building the SJTU-TIS text-guided saliency database and proposing TGSal, a multimodal saliency predictor that fuses image features from ResNet-50 with text features from a CLIP encoder. The model employs global and local text feature fusion plus hierarchical refinement to generate saliency maps under both pure-image and text-guided conditions, and it outperforms state-of-the-art unimodal and multimodal baselines on SALICON and the new SJTU-TIS benchmarks. Key contributions include (i) the first text-guided image saliency database with ground-truth eye-tracking data for multiple text conditions, (ii) a robust TGSal architecture with dedicated global/local fusion modules and a tailored loss combining and terms, and (iii) extensive ablations and cross-dataset analyses demonstrating the practical value of text–image fusion for saliency prediction. The results suggest strong potential for text-conditioned visual analytics and multimodal saliency applications, with code and data to be released for reproducibility and further research.

Abstract

The analysis and prediction of visual attention have long been crucial tasks in the fields of computer vision and image processing. In practical applications, images are generally accompanied by various text descriptions, however, few studies have explored the influence of text descriptions on visual attention, let alone developed visual saliency prediction models considering text guidance. In this paper, we conduct a comprehensive study on text-guided image saliency (TIS) from both subjective and objective perspectives. Specifically, we construct a TIS database named SJTU-TIS, which includes 1200 text-image pairs and the corresponding collected eye-tracking data. Based on the established SJTU-TIS database, we analyze the influence of various text descriptions on visual attention. Then, to facilitate the development of saliency prediction models considering text influence, we construct a benchmark for the established SJTU-TIS database using state-of-the-art saliency models. Finally, considering the effect of text descriptions on visual attention, while most existing saliency models ignore this impact, we further propose a text-guided saliency (TGSal) prediction model, which extracts and integrates both image features and text features to predict the image saliency under various text-description conditions. Our proposed model significantly outperforms the state-of-the-art saliency models on both the SJTU-TIS database and the pure image saliency databases in terms of various evaluation metrics. The SJTU-TIS database and the code of the proposed TGSal model will be released at: https://github.com/IntMeGroup/TGSal.
Paper Structure (37 sections, 14 equations, 13 figures, 17 tables)

This paper contains 37 sections, 14 equations, 13 figures, 17 tables.

Figures (13)

  • Figure 1: The $1_{\text{st}}$ column: heatmaps of the human gaze on the original image, and the images with three different text descriptions. The $2_{\text{nd}}$ column: corresponding prediction results of our model.
  • Figure 2: Schematic diagram of our eye-tracking experiment. Comparison between pure image condition and text-guided condition is shown.
  • Figure 3: Classification of the attributes of the texts.
  • Figure 4: Examples of the collected different scenes. (a) Indoor scenes. (b) Natural scenes. (c) Urban scenes. (d) Party scenes.
  • Figure 5: An illustration of the example images and the corresponding fixation schematic map under four different text-description conditions given in Fig. \ref{['Classification']}. Red boxes represent salient objects, blue boxes represent non-salient objects, and yellow boxes represent common descriptions containing both salient and non-salient objects. Red points: text-guided condition. Green points: pure image condition.
  • ...and 8 more figures