Table of Contents
Fetching ...

Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning

Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner, Hongyuan Mei, Hao Peng, Yue Guo

Abstract

Clinical diagnosis is a complex reasoning process in which clinicians gather evidence, form hypotheses, and test them against alternative explanations. In medical training, this reasoning is explicitly developed through counterfactual questioning--e.g., asking how a diagnosis would change if a key symptom were absent or altered--to strengthen differential diagnosis skills. As large language model (LLM)-based systems are increasingly used for diagnostic support, ensuring the interpretability of their recommendations becomes critical. However, most existing LLM-based diagnostic agents reason over fixed clinical evidence without explicitly testing how individual findings support or weaken competing diagnoses. In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded. Our framework introduces counterfactual case editing to modify clinical findings and evaluate how these changes affect competing diagnoses. We further define the Counterfactual Probability Gap, a method that quantifies how strongly individual findings support a diagnosis by measuring confidence shifts under these edits. These counterfactual signals guide multi-round specialist discussions, enabling agents to challenge unsupported hypotheses, refine differential diagnoses, and produce more interpretable reasoning trajectories. Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with the largest gains observed in complex and ambiguous cases. Human evaluation further indicates that our framework produces more clinically useful, reliable, and coherent reasoning. These results suggest that incorporating counterfactual evidence verification is an important step toward building reliable AI systems for clinical decision support.

Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning

Abstract

Clinical diagnosis is a complex reasoning process in which clinicians gather evidence, form hypotheses, and test them against alternative explanations. In medical training, this reasoning is explicitly developed through counterfactual questioning--e.g., asking how a diagnosis would change if a key symptom were absent or altered--to strengthen differential diagnosis skills. As large language model (LLM)-based systems are increasingly used for diagnostic support, ensuring the interpretability of their recommendations becomes critical. However, most existing LLM-based diagnostic agents reason over fixed clinical evidence without explicitly testing how individual findings support or weaken competing diagnoses. In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded. Our framework introduces counterfactual case editing to modify clinical findings and evaluate how these changes affect competing diagnoses. We further define the Counterfactual Probability Gap, a method that quantifies how strongly individual findings support a diagnosis by measuring confidence shifts under these edits. These counterfactual signals guide multi-round specialist discussions, enabling agents to challenge unsupported hypotheses, refine differential diagnoses, and produce more interpretable reasoning trajectories. Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with the largest gains observed in complex and ambiguous cases. Human evaluation further indicates that our framework produces more clinically useful, reliable, and coherent reasoning. These results suggest that incorporating counterfactual evidence verification is an important step toward building reliable AI systems for clinical decision support.

Paper Structure

This paper contains 53 sections, 6 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Overview of the proposed counterfactual case editing-based multi-agent diagnostic framework. Given a clinical case, a triage agent selects a set of relevant medical specialists, and an initial differential diagnosis (DDx) is generated to guide subsequent reasoning. During multi-round discussion, each specialist performs counterfactual case editing by modifying targeted clinical findings to test competing diagnostic hypotheses. The impact of these edits is quantified using the Counterfactual Probability Gap (CPG), which measures changes in diagnostic confidence before and after the target evidence operation. These counterfactual signals guide iterative discussion, enabling specialists to explicitly test how individual clinical findings support or weaken competing diagnoses. A final judge agent synthesizes the discussion into an evidence-grounded diagnostic decision and structured reasoning trace.
  • Figure 2: Average diagnostic accuracy of seven LLMs on three datasets, including (a)Llama-3.1-8B-Instruct, (b)Qwen3-8B, (c)m1-7b-23k, (d)MedReason-8B, (e)medgemma-1.5-4b-it, (f)Deepseek-R1, and (g)GPT-5-mini. Bar graphs indicate the accuracy ± 95% CIs. Numerical results and statistical significance tests are provided in Table \ref{['tab:main-results-std']}.
  • Figure 3: Average diagnostic accuracy of Llama-3.1-8B-Instruct for four diseases/specialties on three datasets, including (a) Disease-level accuracy on MIMIC, (b) Specialty-level accuracy on MedCaseReasoning, (c) Specialty-level accuracy on ER-Reason. Following Liu et al.liu2025generalist, we categorize the test cases into specialties, and select the Top-4 specialties with more relevant diagnoses. Bar graphs indicate the accuracy ± 95% CIs.
  • Figure 4: Multi-round discussion statistics and the impact of counterfactual case editing on diagnostic confidence using Llama-3.1-8B-Instruct in the multi-round discussion stage. (a), Consensus rate achieved by the multi-round discussion across datasets. (b), Average number of the discussion rounds required per case. (c), Specialist diagnosis-change rate across the three datasets. Error bars indicate the standard deviation across three random seeds. Bar graphs indicate the standard deviation of averaged results across three random seeds. (d), Outcomes of diagnosis transformations, categorized by the correctness of the initial and final diagnoses relative to the gold standard (W: wrong, C: correct). (e), Probability density of the diagnostic hypothesis before ($P_{base}$) and after ($P_{CE}$) targeted evidence perturbation during counterfactual case editing. Only cases where the predicted diagnosis remains unchanged before and after CF editing are included. (f), Distribution of diagnosis probability shifts ($\Delta P$) across different target evidence operations on the three datasets. TE: target evidence. All statistics are calculated by averaging the results of three random seeds.
  • Figure 5: Ablation study of Llama-3.1-8B-Instruct over different functional modules on MedCaseReasoning. The shaded area represents the 95% CI. (a) Diagnostic performance with various functional moduels added in our multi-agent diagnostic system. w/o: without; CF: counterfactual. (b) Diagnostic performance with various hyperparameters. DDx: differential diagnosis.
  • ...and 11 more figures