Table of Contents
Fetching ...

A Comparison of Object Detection and Phrase Grounding Models in Chest X-ray Abnormality Localization using Eye-tracking Data

Elham Ghelichkhan, Tolga Tasdizen

TL;DR

The paper addresses chest X-ray abnormality localization by comparing object detection and phrase grounding, using an automatic eye-tracking–driven pipeline to create explainability baselines. It repurposes REFLACX/MIMIC-CXR data to train and evaluate both approaches, showing that text-guided phrase grounding ($ ext{mIoU}=0.36$) outperforms object detection ($ ext{mIoU}=0.20$) and yields higher explainability (CR $ eq$ 0.26 vs 0.48). An ET-based bounding-box generation process demonstrates that radiologists’ gaze regions align with abnormalities and can be learned by models, with PG achieving superior coverage of relevant regions. The work provides a scalable framework for integrating eye-tracking data into local VLMs and suggests future enhancements such as multiple boxes per statement and deeper integration of ET signals to boost both accuracy and interpretability in clinical localization tasks.

Abstract

Chest diseases rank among the most prevalent and dangerous global health issues. Object detection and phrase grounding deep learning models interpret complex radiology data to assist healthcare professionals in diagnosis. Object detection locates abnormalities for classes, while phrase grounding locates abnormalities for textual descriptions. This paper investigates how text enhances abnormality localization in chest X-rays by comparing the performance and explainability of these two tasks. To establish an explainability baseline, we proposed an automatic pipeline to generate image regions for report sentences using radiologists' eye-tracking data. The better performance - mIoU = 0.36 vs. 0.20 - and explainability - Containment ratio 0.48 vs. 0.26 - of the phrase grounding model infers the effectiveness of text in enhancing chest X-ray abnormality localization.

A Comparison of Object Detection and Phrase Grounding Models in Chest X-ray Abnormality Localization using Eye-tracking Data

TL;DR

The paper addresses chest X-ray abnormality localization by comparing object detection and phrase grounding, using an automatic eye-tracking–driven pipeline to create explainability baselines. It repurposes REFLACX/MIMIC-CXR data to train and evaluate both approaches, showing that text-guided phrase grounding () outperforms object detection () and yields higher explainability (CR 0.26 vs 0.48). An ET-based bounding-box generation process demonstrates that radiologists’ gaze regions align with abnormalities and can be learned by models, with PG achieving superior coverage of relevant regions. The work provides a scalable framework for integrating eye-tracking data into local VLMs and suggests future enhancements such as multiple boxes per statement and deeper integration of ET signals to boost both accuracy and interpretability in clinical localization tasks.

Abstract

Chest diseases rank among the most prevalent and dangerous global health issues. Object detection and phrase grounding deep learning models interpret complex radiology data to assist healthcare professionals in diagnosis. Object detection locates abnormalities for classes, while phrase grounding locates abnormalities for textual descriptions. This paper investigates how text enhances abnormality localization in chest X-rays by comparing the performance and explainability of these two tasks. To establish an explainability baseline, we proposed an automatic pipeline to generate image regions for report sentences using radiologists' eye-tracking data. The better performance - mIoU = 0.36 vs. 0.20 - and explainability - Containment ratio 0.48 vs. 0.26 - of the phrase grounding model infers the effectiveness of text in enhancing chest X-ray abnormality localization.

Paper Structure

This paper contains 12 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Automatically generated BBs for report sentences.
  • Figure 2: Automatic pipeline of ET bounding box generation. (a) Starting from the first sentence, we collected fixations from PSI seconds before the sentence starts until the sentence ends. (b) We summed all fixation heatmaps of a sentence to one heatmap; (c) normalized and thresholded them and removed small regions. (d) The BB enclosing the filtered heatmap is the extracted ET bounding box for the sentence.
  • Figure 3: Explainability assessment by containment ratio. The blue-filled, purple-filled, and blue-outlined BBs show the abnormality annotations labeled as Pleural abnormality, ET, and PG predicted BBs for statement biapical pleural thickening versus pleural fluid, respectively. BBs are overlaid on the ET fixation heatmap corresponding to the statement (left).
  • Figure 4: PG (right column) outperforms OD (middle column). The text boxes show color-coded GT, PG statements, and OD false predictions. Transparent-filled and outlined BBs show the color-coded GT and predicted BBs, respectively.