Table of Contents
Fetching ...

Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs

Advait Gosai, Arun Kavishwar, Stephanie L. McNamara, Soujanya Samineni, Renato Umeton, Alexander Chowdhury, William Lotter

TL;DR

This paper evaluates the localization abilities of frontier multimodal LLMs (GPT-4, GPT-5) and a domain-specific model (MedGemma) on chest radiographs using the CheXlocalize dataset. A grid-based prompting pipeline overlays an $8\times8$ grid and asks models to predict the most representative cell for each present pathology, with performance benchmarked against radiologists and a CNN baseline using a hit-rate metric defined by a $50\%$ overlap threshold. Results show GPT-5 achieves the best localization among LLMs (≈$49.7\%$ hit rate) but remains below both the CNN baseline (≈$59.9\%$) and radiologist performance (≈$80.1\%$); MedGemma performs weakest overall, though few-shot prompting yields improvements. The study highlights the potential of large general-purpose models for medical localization tasks while underscoring limitations in fine-grained spatial reasoning and the value of combining LLMs with task-specific tools for reliable clinical use. Overall, the proposed evaluation framework provides a scalable approach to probe spatial understanding in foundation models and informs directions for future integration in clinical workflows.

Abstract

Recent work has shown promising performance of frontier large language models (LLMs) and their multimodal counterparts in medical quizzes and diagnostic tasks, highlighting their potential for broad clinical utility given their accessible, general-purpose nature. However, beyond diagnosis, a fundamental aspect of medical image interpretation is the ability to localize pathological findings. Evaluating localization not only has clinical and educational relevance but also provides insight into a model's spatial understanding of anatomy and disease. Here, we systematically assess two general-purpose MLLMs (GPT-4 and GPT-5) and a domain-specific model (MedGemma) in their ability to localize pathologies on chest radiographs, using a prompting pipeline that overlays a spatial grid and elicits coordinate-based predictions. Averaged across nine pathologies in the CheXlocalize dataset, GPT-5 exhibited a localization accuracy of 49.7%, followed by GPT-4 (39.1%) and MedGemma (17.7%), all lower than a task-specific CNN baseline (59.9%) and a radiologist benchmark (80.1%). Despite modest performance, error analysis revealed that GPT-5's predictions were largely in anatomically plausible regions, just not always precisely localized. GPT-4 performed well on pathologies with fixed anatomical locations, but struggled with spatially variable findings and exhibited anatomically implausible predictions more frequently. MedGemma demonstrated the lowest performance on all pathologies, but showed improvements when provided examples through few shot prompting. Our findings highlight both the promise and limitations of current MLLMs in medical imaging and underscore the importance of integrating them with task-specific tools for reliable use.

Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs

TL;DR

This paper evaluates the localization abilities of frontier multimodal LLMs (GPT-4, GPT-5) and a domain-specific model (MedGemma) on chest radiographs using the CheXlocalize dataset. A grid-based prompting pipeline overlays an grid and asks models to predict the most representative cell for each present pathology, with performance benchmarked against radiologists and a CNN baseline using a hit-rate metric defined by a overlap threshold. Results show GPT-5 achieves the best localization among LLMs (≈ hit rate) but remains below both the CNN baseline (≈) and radiologist performance (≈); MedGemma performs weakest overall, though few-shot prompting yields improvements. The study highlights the potential of large general-purpose models for medical localization tasks while underscoring limitations in fine-grained spatial reasoning and the value of combining LLMs with task-specific tools for reliable clinical use. Overall, the proposed evaluation framework provides a scalable approach to probe spatial understanding in foundation models and informs directions for future integration in clinical workflows.

Abstract

Recent work has shown promising performance of frontier large language models (LLMs) and their multimodal counterparts in medical quizzes and diagnostic tasks, highlighting their potential for broad clinical utility given their accessible, general-purpose nature. However, beyond diagnosis, a fundamental aspect of medical image interpretation is the ability to localize pathological findings. Evaluating localization not only has clinical and educational relevance but also provides insight into a model's spatial understanding of anatomy and disease. Here, we systematically assess two general-purpose MLLMs (GPT-4 and GPT-5) and a domain-specific model (MedGemma) in their ability to localize pathologies on chest radiographs, using a prompting pipeline that overlays a spatial grid and elicits coordinate-based predictions. Averaged across nine pathologies in the CheXlocalize dataset, GPT-5 exhibited a localization accuracy of 49.7%, followed by GPT-4 (39.1%) and MedGemma (17.7%), all lower than a task-specific CNN baseline (59.9%) and a radiologist benchmark (80.1%). Despite modest performance, error analysis revealed that GPT-5's predictions were largely in anatomically plausible regions, just not always precisely localized. GPT-4 performed well on pathologies with fixed anatomical locations, but struggled with spatially variable findings and exhibited anatomically implausible predictions more frequently. MedGemma demonstrated the lowest performance on all pathologies, but showed improvements when provided examples through few shot prompting. Our findings highlight both the promise and limitations of current MLLMs in medical imaging and underscore the importance of integrating them with task-specific tools for reliable use.

Paper Structure

This paper contains 17 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: Study Overview. a) The nine pathologies in the CheXlocalize dataset were utilized. b) For each pathology class, a gridded image was produced for each radiograph where the pathology was present. c) Three multimodal LLMs were prompted to identify the grid cell where the stated pathology is most prominent. d) The predicted grid cells were compared to the ground truth cells, defined as cells with $\ge$50% overlap with the ground truth mask.
  • Figure 2: Hit rate by pathology. Pathologies are ordered by the relative performance difference between GPT-5 and the CNN baseline. Error bars represent standard deviation via bootstrapping.
  • Figure 3: Error categorization. Each prediction on a frontal radiograph was categorized as a full hit ($\geq$50% overlap), partial hit (0 $<$ overlap $<$ 50%), position error (no overlap but plausible anatomy), and anatomy error (implausible anatomy).
  • Figure 4: Example ground truth and prediction heatmaps. The heatmaps are computed over the frontal radiographs in the test set and are overlaid on the average image.
  • Figure 5: Example anatomy errors. a) A consolidation example where GPT-5 is correct but GPT-4's prediction overlays the heart/mediastinum. b) A pneumothorax example where both GPT-4 and GPT-5 predictions overlay the shoulder instead of the lungs.
  • ...and 6 more figures