Table of Contents
Fetching ...

CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework

Yuexi Du, Jinglu Wang, Shujie Liu, Nicha C. Dvornek, Yan Lu

TL;DR

The experiments demonstrate that an agentic framework that emulates clinical workflows, incorporating decoupled specialized models and explicit evidence, yields more accurate and accountable medical AI.

Abstract

Large visual language models (VLMs) have shown strong multi-modal medical reasoning ability, but most operate as end-to-end black boxes, diverging from clinicians' evidence-based, staged workflows and hindering clinical accountability. Complementarily, expert visual grounding models can accurately localize regions of interest (ROIs), providing explicit, reliable evidence that improves both reasoning accuracy and trust. In this paper, we introduce CARE, advancing Clinical Accountability in multi-modal medical Reasoning with an Evidence-grounded agentic framework. Unlike existing approaches that couple grounding and reasoning within a single generalist model, CARE decomposes the task into coordinated sub-modules to reduce shortcut learning and hallucination: a compact VLM proposes relevant medical entities; an expert entity-referring segmentation model produces pixel-level ROI evidence; and a grounded VLM reasons over the full image augmented by ROI hints. The VLMs are optimized with reinforcement learning with verifiable rewards to align answers with supporting evidence. Furthermore, a VLM coordinator plans tool invocation and reviews evidence-answer consistency, providing agentic control and final verification. Evaluated on standard medical VQA benchmarks, our CARE-Flow (coordinator-free) improves average accuracy by 10.9% over the same size (10B) state-of-the-art (SOTA). With dynamic planning and answer review, our CARE-Coord yields a further gain, outperforming the heavily pre-trained SOTA by 5.2%. Our experiments demonstrate that an agentic framework that emulates clinical workflows, incorporating decoupled specialized models and explicit evidence, yields more accurate and accountable medical AI.

CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework

TL;DR

The experiments demonstrate that an agentic framework that emulates clinical workflows, incorporating decoupled specialized models and explicit evidence, yields more accurate and accountable medical AI.

Abstract

Large visual language models (VLMs) have shown strong multi-modal medical reasoning ability, but most operate as end-to-end black boxes, diverging from clinicians' evidence-based, staged workflows and hindering clinical accountability. Complementarily, expert visual grounding models can accurately localize regions of interest (ROIs), providing explicit, reliable evidence that improves both reasoning accuracy and trust. In this paper, we introduce CARE, advancing Clinical Accountability in multi-modal medical Reasoning with an Evidence-grounded agentic framework. Unlike existing approaches that couple grounding and reasoning within a single generalist model, CARE decomposes the task into coordinated sub-modules to reduce shortcut learning and hallucination: a compact VLM proposes relevant medical entities; an expert entity-referring segmentation model produces pixel-level ROI evidence; and a grounded VLM reasons over the full image augmented by ROI hints. The VLMs are optimized with reinforcement learning with verifiable rewards to align answers with supporting evidence. Furthermore, a VLM coordinator plans tool invocation and reviews evidence-answer consistency, providing agentic control and final verification. Evaluated on standard medical VQA benchmarks, our CARE-Flow (coordinator-free) improves average accuracy by 10.9% over the same size (10B) state-of-the-art (SOTA). With dynamic planning and answer review, our CARE-Coord yields a further gain, outperforming the heavily pre-trained SOTA by 5.2%. Our experiments demonstrate that an agentic framework that emulates clinical workflows, incorporating decoupled specialized models and explicit evidence, yields more accurate and accountable medical AI.
Paper Structure (36 sections, 7 equations, 24 figures, 18 tables)

This paper contains 36 sections, 7 equations, 24 figures, 18 tables.

Figures (24)

  • Figure 1: VLMs for medical reasoning. (a) Single-shot VLMs often miss local evidence. (b) Grounding VLMs do not explicitly utilize ROI in reasoning. (c) Generalist visual reasoning VLMs fail with incorrect initial focus. (d) Our agentic CARE-Coord performs grounded evidence-based reasoning and expert discussion, improving accountability. (e) Comparison of average medical VQA accuracy vs. model size. Models with unknown size appear in the rightmost panel.
  • Figure 2: Method overview. The proposed CARE comprises a VLM coordinator and a set of task-specific expert models. The coordinator plans tool use and conducts answer review, invoking specialist models as needed. The expert set includes: (1) a question-conditioned entity-proposal VLM that identifies relevant anatomical structures/findings; (2) a referring segmentation model that localizes entities with pixel-level ROI evidence; and (3) an evidence-grounded VQA VLM that reasons over the image augmented with selected visual evidence (zoom-in, mask, or global indicator).
  • Figure 3: Case Study. We present the complete reasoning trace for a CT disease identification question. Key information from the coordinator is highlighted in blue, model reasoning in green, and each model’s final answer in yellow.
  • Figure 4: Prompt for Data Synthesis. We present the prompt used for the GPT-4o model to synthesize training data for the entity proposal model. We ask the model to generate questions based on the given meta-information of the provided image. The question is related to the medical entity/ies in the metadata.
  • Figure 5: Example Metadata for Data Synthesis. We present the metadata used for the GPT-4o model to synthesize training data for the entity proposal model. It includes the information about the original image, medical entities labeled from the dataset, and other related information, like the position of each mask.
  • ...and 19 more figures