Table of Contents
Fetching ...

Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA

John Ray B. Martinez

Abstract

Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5-7B-Instruct. Each diagnosis is then subjected to a two-phase self-verification process that measures internal consistency and produces a Specialist Confidence Score (S-score). The S-scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate across four experimental settings, covering 100-question and 250-question high-disagreement subsets of both MedQA-USMLE and MedMCQA. Calibration improvement is the central finding, with ECE reduced by 49-74% across all four settings, including the harder MedMCQA benchmark where these gains persist even when absolute accuracy is constrained by knowledge-intensive recall demands. On MedQA-250, the full system achieves ECE = 0.091 (74.4% reduction over the single-specialist baseline) and AUROC = 0.630 (+0.056) at 59.2% accuracy. Ablation analysis identifies Two-Phase Verification as the primary calibration driver and multi-agent reasoning as the primary accuracy driver. These results establish that consistency-based verification produces more reliable uncertainty estimates across diverse medical question types, providing a practical confidence signal for deferral in safety-critical clinical AI applications.

Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA

Abstract

Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5-7B-Instruct. Each diagnosis is then subjected to a two-phase self-verification process that measures internal consistency and produces a Specialist Confidence Score (S-score). The S-scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate across four experimental settings, covering 100-question and 250-question high-disagreement subsets of both MedQA-USMLE and MedMCQA. Calibration improvement is the central finding, with ECE reduced by 49-74% across all four settings, including the harder MedMCQA benchmark where these gains persist even when absolute accuracy is constrained by knowledge-intensive recall demands. On MedQA-250, the full system achieves ECE = 0.091 (74.4% reduction over the single-specialist baseline) and AUROC = 0.630 (+0.056) at 59.2% accuracy. Ablation analysis identifies Two-Phase Verification as the primary calibration driver and multi-agent reasoning as the primary accuracy driver. These results establish that consistency-based verification produces more reliable uncertainty estimates across diverse medical question types, providing a practical confidence signal for deferral in safety-critical clinical AI applications.

Paper Structure

This paper contains 29 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The MARC framework pipeline. A medical question is distributed to four domain-specific specialist agents (Component 1), each producing an answer and reasoning chain. Two-Phase Consistency Verification (Component 2; wu2024uncertainty) extracts factual claims, answers them independently and with the original reasoning in context, and measures cross-condition inconsistency to derive an S-score $S_k$ per specialist. S-Score Weighted Fusion (Component 3) selects the final answer by vote-count-weighted mean S-score and derives calibrated confidence $\hat{C}$ from the vote fraction and S-score distribution.
  • Figure 2: Accuracy (top row), ECE (middle row), and AUROC (bottom row) for each of the four datasets and all four configurations. Green-outlined bars indicate the best value in each panel. Config 4 (Full System, dark blue) achieves the best ECE across all four datasets, and the best accuracy on the MedQA subsets. On MedMCQA, Config 3 achieves the best accuracy at the cost of calibration, illustrating the knowledge-recall challenge for verification-based methods.
  • Figure 3: Reliability diagrams for all four datasets. Each panel shows calibration curves for all four configurations against the perfect-calibration diagonal. Across all datasets, Config 4 (solid blue squares) lies closest to the diagonal, confirming that Two-Phase Verification is the dominant calibration driver regardless of question type or dataset size.
  • Figure 4: ROC curves for all four datasets. Each panel plots the four configuration curves with the random baseline (diagonal). Config 4 achieves the highest AUROC on MedQA-250 and MedMCQA-250. On MedQA-100, Config 2 (single specialist with verification) achieves the highest AUROC, and on MedMCQA-100, Config 3 (multi-agent without verification) is best, illustrating that discrimination benefits from multi-agent fusion scale more reliably with dataset size and question type.
  • Figure 5: Calibration analysis of Qwen2.5-7B across all four evaluation sets. Columns represent configurations C1--C4; each dataset occupies two sub-rows. Top sub-row: stacked confidence frequency histogram (blue = correct answer, red = wrong answer). Bottom sub-row: calibration histogram with 5%-wide bins for visual resolution, where bar height represents observed accuracy within that bin; the dashed diagonal represents perfect calibration. Config 4 (Full System) consistently produces the most evenly spread confidence distributions and aligns most closely with the perfect-calibration diagonal across all datasets.
  • ...and 1 more figures