Table of Contents
Fetching ...

GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths

Xianyu Chen, Ming Jiang, Qi Zhao

TL;DR

GazeXplain addresses the problem of explainable visual scanpath prediction by jointly predicting fixations and natural language explanations. It introduces a general vision–language encoder with an attention–language decoder, a semantic alignment mechanism, and cross-dataset co-training to enable robust generalization across diverse eye-tracking tasks. The approach is validated on AiR-D, OSIE, and COCO-Search18, showing state-of-the-art performance in both scanpath accuracy and explanation quality, with ablations confirming the contributions of each component. This work advances interpretable models of human visual attention and lays a foundation for integrating explainable gaze reasoning into vision–language systems and cognitive science research.

Abstract

While exploring visual scenes, humans' scanpaths are driven by their underlying attention processes. Understanding visual scanpaths is essential for various applications. Traditional scanpath models predict the where and when of gaze shifts without providing explanations, creating a gap in understanding the rationale behind fixations. To bridge this gap, we introduce GazeXplain, a novel study of visual scanpath prediction and explanation. This involves annotating natural-language explanations for fixations across eye-tracking datasets and proposing a general model with an attention-language decoder that jointly predicts scanpaths and generates explanations. It integrates a unique semantic alignment mechanism to enhance the consistency between fixations and explanations, alongside a cross-dataset co-training approach for generalization. These novelties present a comprehensive and adaptable solution for explainable human visual scanpath prediction. Extensive experiments on diverse eye-tracking datasets demonstrate the effectiveness of GazeXplain in both scanpath prediction and explanation, offering valuable insights into human visual attention and cognitive processes.

GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths

TL;DR

GazeXplain addresses the problem of explainable visual scanpath prediction by jointly predicting fixations and natural language explanations. It introduces a general vision–language encoder with an attention–language decoder, a semantic alignment mechanism, and cross-dataset co-training to enable robust generalization across diverse eye-tracking tasks. The approach is validated on AiR-D, OSIE, and COCO-Search18, showing state-of-the-art performance in both scanpath accuracy and explanation quality, with ablations confirming the contributions of each component. This work advances interpretable models of human visual attention and lays a foundation for integrating explainable gaze reasoning into vision–language systems and cognitive science research.

Abstract

While exploring visual scenes, humans' scanpaths are driven by their underlying attention processes. Understanding visual scanpaths is essential for various applications. Traditional scanpath models predict the where and when of gaze shifts without providing explanations, creating a gap in understanding the rationale behind fixations. To bridge this gap, we introduce GazeXplain, a novel study of visual scanpath prediction and explanation. This involves annotating natural-language explanations for fixations across eye-tracking datasets and proposing a general model with an attention-language decoder that jointly predicts scanpaths and generates explanations. It integrates a unique semantic alignment mechanism to enhance the consistency between fixations and explanations, alongside a cross-dataset co-training approach for generalization. These novelties present a comprehensive and adaptable solution for explainable human visual scanpath prediction. Extensive experiments on diverse eye-tracking datasets demonstrate the effectiveness of GazeXplain in both scanpath prediction and explanation, offering valuable insights into human visual attention and cognitive processes.
Paper Structure (37 sections, 4 equations, 7 figures, 7 tables)

This paper contains 37 sections, 4 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: This example reveals how observers strategically investigate a scene to find out if the person is walking on the sidewalk. Fixations (circles) start centrally, locating a driving car, then shifting to the sidewalk to find the person, and finally looking down to confirm if the person is walking. By annotating observers' scanpaths with detailed explanations, we enable a deeper understanding of the "what" and "why" behind fixations, providing insights into human decision-making and task performance.
  • Figure 1: Quantitative examples from GazeXplain compared to Gazeformer and the ground truth on the OSIE dataset. Each row shows scanpaths and explanations of two key fixations.
  • Figure 2: LLaVA generates the ground-truth explanation for each fixation using an input image with a red circle marking the fixation. The model's response provides information within the marked area, serving as a basis for further refinement.
  • Figure 2: Quantitative examples from GazeXplain compared to Gazeformer and the ground truth on the COCO-Search18 dataset. Each row shows scanpaths and explanations of two key fixations.
  • Figure 3: GazeXplain's architecture consists of a general vision-language encoder and a novel attention-language decoder. The decoder outputs an explanation for each fixation in the predicted scanpath, with a semantic alignment mechanism facilitating the semantic consistency between fixations and explanations. The model is developed on three public datasets using a cross-dataset co-training technique.
  • ...and 2 more figures