MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models

Mohammad Shahab Sepehri; Zalan Fabian; Maryam Soltanolkotabi; Mahdi Soltanolkotabi

MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models

Mohammad Shahab Sepehri, Zalan Fabian, Maryam Soltanolkotabi, Mahdi Soltanolkotabi

TL;DR

Medical MLLMs face safety-critical reliability challenges in radiology. The authors introduce MediConfusion, a vision-focused VQA benchmark built from confusing image pairs derived from ROCO and curated with radiologist input to probe multimodal reasoning beyond unimodal priors. Across 13 models, most systems perform no better than random and exhibit high confusion, with best-case performance around 61.9% in certain categories, indicating substantial reliability gaps. They analyze common failure modes and show that improving visual encoders and enabling OCR-based prompts may be necessary to achieve trustworthy medical AI.

Abstract

Multimodal Large Language Models (MLLMs) have tremendous potential to improve the accuracy, availability, and cost-effectiveness of healthcare by providing automated solutions or serving as aids to medical professionals. Despite promising first steps in developing medical MLLMs in the past few years, their capabilities and limitations are not well-understood. Recently, many benchmark datasets have been proposed that test the general medical knowledge of such models across a variety of medical areas. However, the systematic failure modes and vulnerabilities of such models are severely underexplored with most medical benchmarks failing to expose the shortcomings of existing models in this safety-critical domain. In this paper, we introduce MediConfusion, a challenging medical Visual Question Answering (VQA) benchmark dataset, that probes the failure modes of medical MLLMs from a vision perspective. We reveal that state-of-the-art models are easily confused by image pairs that are otherwise visually dissimilar and clearly distinct for medical experts. Strikingly, all available models (open-source or proprietary) achieve performance below random guessing on MediConfusion, raising serious concerns about the reliability of existing medical MLLMs for healthcare deployment. We also extract common patterns of model failure that may help the design of a new generation of more trustworthy and reliable MLLMs in healthcare.

MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models

TL;DR

Abstract

Paper Structure (31 sections, 9 figures, 10 tables)

This paper contains 31 sections, 9 figures, 10 tables.

Introduction
The MediConfusion Benchmark
Discovering confusing pairs
VQA generation
Data filtering and revision via radiologist feedback
Experiments
Evaluation
Results
Discussion
Identifying Patterns in Confusing Pairs
Visual prompts in MediConfusion
Related Work
Conclusion
Prompts for dataset curation
Question generation
...and 16 more sections

Figures (9)

Figure 1: Overview of MediConfusion curation pipeline. First, we extract image pairs from the ROCO radiology dataset that are clearly distinct in the image domain, but may be challenging to differentiate between for multimodal models (left). Next, we use an automated pipeline leveraging LLM prompting to generate VQA from the confusing pairs and their corresponding captions (center). Finally, we incorporate radiologist feedback to filter questions for correctness, relevance and quality, and to revise the questions and answer options for improved medical language and precision (right).
Figure 2: Sample confusing image pairs we have extracted from the ROCO dataset across $9$ categories.
Figure 3: A VQA pair from MediConfusion. A confusing pair shares the same question and answer options, but the correct answer is different for the two (A for the image on the left and B for the image on the right). The model receives a set score only if it correctly answers both questions in the confusing pair. Individual score is evaluated separately for each image.
Figure 4: Distribution of question categories in MediConfusion. We assign a category to the question based on the category of the corresponding image in the VQA. A single image can belong to multiple categories at the same time.
Figure 5: Sample VQA from MediConfusion where the solution is directly provided in the image in the form of text and visual prompts (arrows). Medical MLLMs not trained for OCR have been unable to leverage the hint.
...and 4 more figures

MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models

TL;DR

Abstract

MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)