Table of Contents
Fetching ...

Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation

Yunsoo Kim, Jinge Wu, Su-Hwan Kim, Pardeep Vasudev, Jiashu Shen, Honghan Wu

TL;DR

The paper introduces Look & Mark (L&M), a prompt-based grounding strategy that fuses radiologist eye fixations (Look) and bounding box annotations (Mark) to ground chest X-ray report generation by multimodal LLMs without retraining. Across both domain-specific and general-purpose models, L&M consistently improves clinical-relevance metrics (e.g., RadGraph-XL, RaTEScore) and reduces clinically significant errors, with expert radiologists validating reduced error rates. The approach also benefits general models when combined with in-context learning, achieving near-top clinical performance (e.g., LLaVA-OV with I&L&M reaching high C.AVG). These results highlight L&M as a scalable, data-efficient pathway to robust AI-assisted radiology in settings with limited resources, while future work aims to extend to other modalities and multi-view scenarios.

Abstract

Recent advancements in multimodal Large Language Models (LLMs) have significantly enhanced the automation of medical image analysis, particularly in generating radiology reports from chest X-rays (CXR). However, these models still suffer from hallucinations and clinically significant errors, limiting their reliability in real-world applications. In this study, we propose Look & Mark (L&M), a novel grounding fixation strategy that integrates radiologist eye fixations (Look) and bounding box annotations (Mark) into the LLM prompting framework. Unlike conventional fine-tuning, L&M leverages in-context learning to achieve substantial performance gains without retraining. When evaluated across multiple domain-specific and general-purpose models, L&M demonstrates significant gains, including a 1.2% improvement in overall metrics (A.AVG) for CXR-LLaVA compared to baseline prompting and a remarkable 9.2% boost for LLaVA-Med. General-purpose models also benefit from L&M combined with in-context learning, with LLaVA-OV achieving an 87.3% clinical average performance (C.AVG)-the highest among all models, even surpassing those explicitly trained for CXR report generation. Expert evaluations further confirm that L&M reduces clinically significant errors (by 0.43 average errors per report), such as false predictions and omissions, enhancing both accuracy and reliability. These findings highlight L&M's potential as a scalable and efficient solution for AI-assisted radiology, paving the way for improved diagnostic workflows in low-resource clinical settings.

Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation

TL;DR

The paper introduces Look & Mark (L&M), a prompt-based grounding strategy that fuses radiologist eye fixations (Look) and bounding box annotations (Mark) to ground chest X-ray report generation by multimodal LLMs without retraining. Across both domain-specific and general-purpose models, L&M consistently improves clinical-relevance metrics (e.g., RadGraph-XL, RaTEScore) and reduces clinically significant errors, with expert radiologists validating reduced error rates. The approach also benefits general models when combined with in-context learning, achieving near-top clinical performance (e.g., LLaVA-OV with I&L&M reaching high C.AVG). These results highlight L&M as a scalable, data-efficient pathway to robust AI-assisted radiology in settings with limited resources, while future work aims to extend to other modalities and multi-view scenarios.

Abstract

Recent advancements in multimodal Large Language Models (LLMs) have significantly enhanced the automation of medical image analysis, particularly in generating radiology reports from chest X-rays (CXR). However, these models still suffer from hallucinations and clinically significant errors, limiting their reliability in real-world applications. In this study, we propose Look & Mark (L&M), a novel grounding fixation strategy that integrates radiologist eye fixations (Look) and bounding box annotations (Mark) into the LLM prompting framework. Unlike conventional fine-tuning, L&M leverages in-context learning to achieve substantial performance gains without retraining. When evaluated across multiple domain-specific and general-purpose models, L&M demonstrates significant gains, including a 1.2% improvement in overall metrics (A.AVG) for CXR-LLaVA compared to baseline prompting and a remarkable 9.2% boost for LLaVA-Med. General-purpose models also benefit from L&M combined with in-context learning, with LLaVA-OV achieving an 87.3% clinical average performance (C.AVG)-the highest among all models, even surpassing those explicitly trained for CXR report generation. Expert evaluations further confirm that L&M reduces clinically significant errors (by 0.43 average errors per report), such as false predictions and omissions, enhancing both accuracy and reliability. These findings highlight L&M's potential as a scalable and efficient solution for AI-assisted radiology, paving the way for improved diagnostic workflows in low-resource clinical settings.

Paper Structure

This paper contains 22 sections, 4 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of the Look & Mark framework. The input to the model consists of a chest X-ray image augmented with two forms of expert-derived visual grounding: (1) Bounding boxes highlighting abnormal findings (Mark), and (2) Radiologist eye fixations, converted into fixation heatmaps and summarized in the prompt as text (Look). The bounding boxes are overlaid on the image as part of the visual input, while the fixation durations mapped to abnormalities are embedded in the textual prompt. These dual cues are used to construct an in-context prompt for a multimodal LLM, enabling it to generate clinically relevant, grounded radiology reports without model retraining. To comply with the MIMIC-CXR data usage license, we use a substitute image from Wikimedia that reflects a comparable diagnosis, and paraphrase the associated report text.
  • Figure 2: Performance increase/decrease in A.AVG of L&M compared to L and M for domain-specific models.
  • Figure 3: Performance increase/decrease in A.AVG of I&L&M compared to I&L and I&M for general-purpose models.
  • Figure 4: Expert analysis of model outputs. Red-colored text shows the clinically significant error marked by radiologist.
  • Figure 5: Heatmap of normalized scores across general-purpose models to compare in-context learning and our method.