A Comparison of Object Detection and Phrase Grounding Models in Chest X-ray Abnormality Localization using Eye-tracking Data
Elham Ghelichkhan, Tolga Tasdizen
TL;DR
The paper addresses chest X-ray abnormality localization by comparing object detection and phrase grounding, using an automatic eye-tracking–driven pipeline to create explainability baselines. It repurposes REFLACX/MIMIC-CXR data to train and evaluate both approaches, showing that text-guided phrase grounding ($ ext{mIoU}=0.36$) outperforms object detection ($ ext{mIoU}=0.20$) and yields higher explainability (CR $ eq$ 0.26 vs 0.48). An ET-based bounding-box generation process demonstrates that radiologists’ gaze regions align with abnormalities and can be learned by models, with PG achieving superior coverage of relevant regions. The work provides a scalable framework for integrating eye-tracking data into local VLMs and suggests future enhancements such as multiple boxes per statement and deeper integration of ET signals to boost both accuracy and interpretability in clinical localization tasks.
Abstract
Chest diseases rank among the most prevalent and dangerous global health issues. Object detection and phrase grounding deep learning models interpret complex radiology data to assist healthcare professionals in diagnosis. Object detection locates abnormalities for classes, while phrase grounding locates abnormalities for textual descriptions. This paper investigates how text enhances abnormality localization in chest X-rays by comparing the performance and explainability of these two tasks. To establish an explainability baseline, we proposed an automatic pipeline to generate image regions for report sentences using radiologists' eye-tracking data. The better performance - mIoU = 0.36 vs. 0.20 - and explainability - Containment ratio 0.48 vs. 0.26 - of the phrase grounding model infers the effectiveness of text in enhancing chest X-ray abnormality localization.
