Table of Contents
Fetching ...

One Panel Does Not Fit All: Case-Adaptive Multi-Agent Deliberation for Clinical Prediction

Yuxing Lu, Yushuhong Lin, Jason Zhang

Abstract

Large language models applied to clinical prediction exhibit case-level heterogeneity: simple cases yield consistent outputs, while complex cases produce divergent predictions under minor prompt changes. Existing single-agent strategies sample from one role-conditioned distribution, and multi-agent frameworks use fixed roles with flat majority voting, discarding the diagnostic signal in disagreement. We propose CAMP (Case-Adaptive Multi-agent Panel), where an attending-physician agent dynamically assembles a specialist panel tailored to each case's diagnostic uncertainty. Each specialist evaluates candidates via three-valued voting (KEEP/REFUSE/NEUTRAL), enabling principled abstention outside one's expertise. A hybrid router directs each diagnosis through strong consensus, fallback to the attending physician's judgment, or evidence-based arbitration that weighs argument quality over vote counts. On diagnostic prediction and brief hospital course generation from MIMIC-IV across four LLM backbones, CAMP consistently outperforms strong baselines while consuming fewer tokens than most competing multi-agent methods, with voting records and arbitration traces offering transparent decision audits.

One Panel Does Not Fit All: Case-Adaptive Multi-Agent Deliberation for Clinical Prediction

Abstract

Large language models applied to clinical prediction exhibit case-level heterogeneity: simple cases yield consistent outputs, while complex cases produce divergent predictions under minor prompt changes. Existing single-agent strategies sample from one role-conditioned distribution, and multi-agent frameworks use fixed roles with flat majority voting, discarding the diagnostic signal in disagreement. We propose CAMP (Case-Adaptive Multi-agent Panel), where an attending-physician agent dynamically assembles a specialist panel tailored to each case's diagnostic uncertainty. Each specialist evaluates candidates via three-valued voting (KEEP/REFUSE/NEUTRAL), enabling principled abstention outside one's expertise. A hybrid router directs each diagnosis through strong consensus, fallback to the attending physician's judgment, or evidence-based arbitration that weighs argument quality over vote counts. On diagnostic prediction and brief hospital course generation from MIMIC-IV across four LLM backbones, CAMP consistently outperforms strong baselines while consuming fewer tokens than most competing multi-agent methods, with voting records and arbitration traces offering transparent decision audits.

Paper Structure

This paper contains 46 sections, 8 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the CAMP framework. Given a clinical note and candidate diagnoses, the attending physician first renders an initial diagnostic judgment, then assembles a case-adaptive specialist panel with directed focus areas. Each specialist quotes evidence from the note and casts a three-valued vote. A hybrid router directs each diagnosis to one of three resolution paths: strong consensus decisions are applied directly, weak consensus falls back to the attending physician's initial judgment, and conflicts are escalated for evidence-based arbitration.
  • Figure 2: Alignment between specialists selected by CAMP and patients' actual hospital services. Each row is normalized to sum to 100%. Specialist-specific services show strong concentration on the matching specialist, while general services exhibit broader distributions.
  • Figure 3: Conflict resolution via attending-physician arbitration. The attending physician overrides a 2-to-1 majority Refuse vote by weighing the quality of competing specialist rationales rather than counting votes.
  • Figure 4: Token consumption vs. Macro F1. CAMP achieves the highest F1 while consuming fewer tokens than most of other multi-agent baselines.
  • Figure 5: Effect of specialist panel size on diagnostic prediction (evaluated on a 200-case subset). The dashed line marks where performance peaks.