Table of Contents
Fetching ...

GazeSearch: Radiology Findings Search Benchmark

Trong Thang Pham, Tien-Phat Nguyen, Yuki Ikebe, Akash Awasthi, Zhigang Deng, Carol C. Wu, Hien Nguyen, Ngan Le

TL;DR

This work tackles the misalignment between radiologists' gaze and radiology findings by introducing GazeSearch, a dataset that converts free-view eye-tracking data into finding-aware visual search sequences for chest X-rays. It then proposes ChestSearch, a transformer-based scanpath predictor pretrained with self-supervised radiology features and guided by a query mechanism to predict subsequent fixations, durations, and termination. The authors demonstrate that GazeSearch enables meaningful modeling of medical visual search and that ChestSearch achieves state-of-the-art alignment with radiologist-like gaze across multiple metrics, offering a solid benchmark for future medical visual search research. Overall, the approach enhances interpretability and trust in AI-assisted radiology by aligning AI attention with expert human gaze and providing a robust evaluation framework.

Abstract

Medical eye-tracking data is an important information source for understanding how radiologists visually interpret medical images. This information not only improves the accuracy of deep learning models for X-ray analysis but also their interpretability, enhancing transparency in decision-making. However, the current eye-tracking data is dispersed, unprocessed, and ambiguous, making it difficult to derive meaningful insights. Therefore, there is a need to create a new dataset with more focus and purposeful eyetracking data, improving its utility for diagnostic applications. In this work, we propose a refinement method inspired by the target-present visual search challenge: there is a specific finding and fixations are guided to locate it. After refining the existing eye-tracking datasets, we transform them into a curated visual search dataset, called GazeSearch, specifically for radiology findings, where each fixation sequence is purposefully aligned to the task of locating a particular finding. Subsequently, we introduce a scan path prediction baseline, called ChestSearch, specifically tailored to GazeSearch. Finally, we employ the newly introduced GazeSearch as a benchmark to evaluate the performance of current state-of-the-art methods, offering a comprehensive assessment for visual search in the medical imaging domain. Code is available at \url{https://github.com/UARK-AICV/GazeSearch}.

GazeSearch: Radiology Findings Search Benchmark

TL;DR

This work tackles the misalignment between radiologists' gaze and radiology findings by introducing GazeSearch, a dataset that converts free-view eye-tracking data into finding-aware visual search sequences for chest X-rays. It then proposes ChestSearch, a transformer-based scanpath predictor pretrained with self-supervised radiology features and guided by a query mechanism to predict subsequent fixations, durations, and termination. The authors demonstrate that GazeSearch enables meaningful modeling of medical visual search and that ChestSearch achieves state-of-the-art alignment with radiologist-like gaze across multiple metrics, offering a solid benchmark for future medical visual search research. Overall, the approach enhances interpretability and trust in AI-assisted radiology by aligning AI attention with expert human gaze and providing a robust evaluation framework.

Abstract

Medical eye-tracking data is an important information source for understanding how radiologists visually interpret medical images. This information not only improves the accuracy of deep learning models for X-ray analysis but also their interpretability, enhancing transparency in decision-making. However, the current eye-tracking data is dispersed, unprocessed, and ambiguous, making it difficult to derive meaningful insights. Therefore, there is a need to create a new dataset with more focus and purposeful eyetracking data, improving its utility for diagnostic applications. In this work, we propose a refinement method inspired by the target-present visual search challenge: there is a specific finding and fixations are guided to locate it. After refining the existing eye-tracking datasets, we transform them into a curated visual search dataset, called GazeSearch, specifically for radiology findings, where each fixation sequence is purposefully aligned to the task of locating a particular finding. Subsequently, we introduce a scan path prediction baseline, called ChestSearch, specifically tailored to GazeSearch. Finally, we employ the newly introduced GazeSearch as a benchmark to evaluate the performance of current state-of-the-art methods, offering a comprehensive assessment for visual search in the medical imaging domain. Code is available at \url{https://github.com/UARK-AICV/GazeSearch}.

Paper Structure

This paper contains 21 sections, 7 equations, 4 figures, 4 tables, 2 algorithms.

Figures (4)

  • Figure 1: (a) Given a CXR image, we are interested in radiologist's eye movement of radiologist when they search for a finding. (b) But, the existing eye gaze datasets are recorded in a free-view form, where fixations are distributed across the entire CXR image and making it unclear which fixations correspond to specific findings. (c) Our new GazeSearch dataset, where fixation sequence is focused for a specific finding. For example, the gaze sequence in (c.1) targets lung opacity, while (c.2) focuses on pneumonia. Each circle depicts a fixation, with the number and radius indicating its order and duration, respectively.
  • Figure 2: Pipeline of GazeSearch creation, which processes free-view eye gaze data as input and outputs a finding-aware scanpath.
  • Figure 3: The figure provides a detailed view of ChestSearch. It begins by processing the previous fixations, denoted as $\{(x_i, y_i, d_i)\}_{i=1}^{t-1}$, along with the input chest X-ray image $I$, through a Feature Extractor and Spatiotemporal Embedding to generates the spatiotemporal embedded feature $E$. Next, the Fixation Decoder uses a learnable query $q_c$ and the embedded feature $E$ to decode it into a feature $\bar{E}$. From here, three heads use $\bar{E}$ to predict the next fixation coordinates $(\hat{x}_t, \hat{y}_t, \hat{d}_t)$. Here, at step $t$, the termination head outputs "Yes," indicating that this is the final fixation for the image $I$.
  • Figure 4: Qualitative results between our ChestSearch compared with ChenLSTM-ISP, Gazeformer, Gazeformer-ISP, and HAT. Four different findings (rows) including Atelectasis, Cardiomegaly, Edema, and Lung lesion are shown from the top to bottom. Each circle represents a fixation, with the number and radius indicating its order and duration, respectively. As HAT only predicts 2D coordinates, we let all predicted fixations of HAT have the same radius.