Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation

Yunsoo Kim; Jinge Wu; Su-Hwan Kim; Pardeep Vasudev; Jiashu Shen; Honghan Wu

Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation

Yunsoo Kim, Jinge Wu, Su-Hwan Kim, Pardeep Vasudev, Jiashu Shen, Honghan Wu

TL;DR

The paper introduces Look & Mark (L&M), a prompt-based grounding strategy that fuses radiologist eye fixations (Look) and bounding box annotations (Mark) to ground chest X-ray report generation by multimodal LLMs without retraining. Across both domain-specific and general-purpose models, L&M consistently improves clinical-relevance metrics (e.g., RadGraph-XL, RaTEScore) and reduces clinically significant errors, with expert radiologists validating reduced error rates. The approach also benefits general models when combined with in-context learning, achieving near-top clinical performance (e.g., LLaVA-OV with I&L&M reaching high C.AVG). These results highlight L&M as a scalable, data-efficient pathway to robust AI-assisted radiology in settings with limited resources, while future work aims to extend to other modalities and multi-view scenarios.

Abstract

Recent advancements in multimodal Large Language Models (LLMs) have significantly enhanced the automation of medical image analysis, particularly in generating radiology reports from chest X-rays (CXR). However, these models still suffer from hallucinations and clinically significant errors, limiting their reliability in real-world applications. In this study, we propose Look & Mark (L&M), a novel grounding fixation strategy that integrates radiologist eye fixations (Look) and bounding box annotations (Mark) into the LLM prompting framework. Unlike conventional fine-tuning, L&M leverages in-context learning to achieve substantial performance gains without retraining. When evaluated across multiple domain-specific and general-purpose models, L&M demonstrates significant gains, including a 1.2% improvement in overall metrics (A.AVG) for CXR-LLaVA compared to baseline prompting and a remarkable 9.2% boost for LLaVA-Med. General-purpose models also benefit from L&M combined with in-context learning, with LLaVA-OV achieving an 87.3% clinical average performance (C.AVG)-the highest among all models, even surpassing those explicitly trained for CXR report generation. Expert evaluations further confirm that L&M reduces clinically significant errors (by 0.43 average errors per report), such as false predictions and omissions, enhancing both accuracy and reliability. These findings highlight L&M's potential as a scalable and efficient solution for AI-assisted radiology, paving the way for improved diagnostic workflows in low-resource clinical settings.

Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation

TL;DR

Abstract

Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)