Table of Contents
Fetching ...

GEM: Context-Aware Gaze EstiMation with Visual Search Behavior Matching for Chest Radiograph

Shaonan Liu, Wenting Chen, Jie Liu, Xiaoling Luo, Linlin Shen

TL;DR

This paper defines and tackles context-aware gaze estimation in medical radiology by linking radiology text with images through a context-aware alignment module, and by modeling radiologists’ visual search with a visual behavior graph and a graph-matching strategy. The proposed GEM framework integrates multi-scale image–text fusion, high-order gaze point relationships via graph networks, and an AIS-based matching objective to closely reproduce expert gaze patterns. Empirical results on four public chest X-ray datasets show GEM achieving state-of-the-art gaze localization and strong zero-shot generalization, with ablation analyses confirming the contributions of the context-aware module and VBMatch. The work advances interpretability and multi-modal utilization in medical imaging, offering a pathway to more accurate, explainable gaze-based diagnostics and training resources.

Abstract

Gaze estimation is pivotal in human scene comprehension tasks, particularly in medical diagnostic analysis. Eye-tracking technology facilitates the recording of physicians' ocular movements during image interpretation, thereby elucidating their visual attention patterns and information-processing strategies. In this paper, we initially define the context-aware gaze estimation problem in medical radiology report settings. To understand the attention allocation and cognitive behavior of radiologists during the medical image interpretation process, we propose a context-aware Gaze EstiMation (GEM) network that utilizes eye gaze data collected from radiologists to simulate their visual search behavior patterns throughout the image interpretation process. It consists of a context-awareness module, visual behavior graph construction, and visual behavior matching. Within the context-awareness module, we achieve intricate multimodal registration by establishing connections between medical reports and images. Subsequently, for a more accurate simulation of genuine visual search behavior patterns, we introduce a visual behavior graph structure, capturing such behavior through high-order relationships (edges) between gaze points (nodes). To maintain the authenticity of visual behavior, we devise a visual behavior-matching approach, adjusting the high-order relationships between them by matching the graph constructed from real and estimated gaze points. Extensive experiments on four publicly available datasets demonstrate the superiority of GEM over existing methods and its strong generalizability, which also provides a new direction for the effective utilization of diverse modalities in medical image interpretation and enhances the interpretability of models in the field of medical imaging. https://github.com/Tiger-SN/GEM

GEM: Context-Aware Gaze EstiMation with Visual Search Behavior Matching for Chest Radiograph

TL;DR

This paper defines and tackles context-aware gaze estimation in medical radiology by linking radiology text with images through a context-aware alignment module, and by modeling radiologists’ visual search with a visual behavior graph and a graph-matching strategy. The proposed GEM framework integrates multi-scale image–text fusion, high-order gaze point relationships via graph networks, and an AIS-based matching objective to closely reproduce expert gaze patterns. Empirical results on four public chest X-ray datasets show GEM achieving state-of-the-art gaze localization and strong zero-shot generalization, with ablation analyses confirming the contributions of the context-aware module and VBMatch. The work advances interpretability and multi-modal utilization in medical imaging, offering a pathway to more accurate, explainable gaze-based diagnostics and training resources.

Abstract

Gaze estimation is pivotal in human scene comprehension tasks, particularly in medical diagnostic analysis. Eye-tracking technology facilitates the recording of physicians' ocular movements during image interpretation, thereby elucidating their visual attention patterns and information-processing strategies. In this paper, we initially define the context-aware gaze estimation problem in medical radiology report settings. To understand the attention allocation and cognitive behavior of radiologists during the medical image interpretation process, we propose a context-aware Gaze EstiMation (GEM) network that utilizes eye gaze data collected from radiologists to simulate their visual search behavior patterns throughout the image interpretation process. It consists of a context-awareness module, visual behavior graph construction, and visual behavior matching. Within the context-awareness module, we achieve intricate multimodal registration by establishing connections between medical reports and images. Subsequently, for a more accurate simulation of genuine visual search behavior patterns, we introduce a visual behavior graph structure, capturing such behavior through high-order relationships (edges) between gaze points (nodes). To maintain the authenticity of visual behavior, we devise a visual behavior-matching approach, adjusting the high-order relationships between them by matching the graph constructed from real and estimated gaze points. Extensive experiments on four publicly available datasets demonstrate the superiority of GEM over existing methods and its strong generalizability, which also provides a new direction for the effective utilization of diverse modalities in medical image interpretation and enhances the interpretability of models in the field of medical imaging. https://github.com/Tiger-SN/GEM
Paper Structure (14 sections, 2 equations, 5 figures, 2 tables)

This paper contains 14 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of the proposed Context-Aware Gaze EstiMation (GEM) network. It consists of a context-aware module for fine-grained inter-modal alignment, a visual behavior graph construction to capture radiologists' visual search behavior, and a visual behavior matching module to preserve the behavior.
  • Figure 2: Qualitative comparison of GEM and other models on MIMIC-Eye dataset. Yellow arrows indicate regions relevant to input texts, and red and blue points represent the GT and estimated gaze points, respectively.
  • Figure 3: Visualization of the ablation study on the MIMIC-Eye dataset.
  • Figure 4: Qualitative evaluation of our method on the easy task of phrase grounding on the MS-CXR dataset. Yellow boxes indicate radiologists' annotations, blue points are estimated points, and the heatmap and box show focal areas highlighted by BioViL Biovilboecking2022making.
  • Figure 5: Qualitative evaluation of our method on the hard task of gaze estimation on OpenI and AIforCOVID datasets. Yellow arrows indicate radiologists' annotation, and blue points represent estimated gaze points.