Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?

Grace Chang Yuan; Xiaoman Zhang; Sung Eun Kim; Pranav Rajpurkar

Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?

Grace Chang Yuan, Xiaoman Zhang, Sung Eun Kim, Pranav Rajpurkar

TL;DR

Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss, highlighted as a key design principle for robust clinical diagnostic systems.

Abstract

Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single-vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss. These results highlight vendor diversity as a key design principle for robust clinical diagnostic systems.

Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?

TL;DR

Abstract

Paper Structure (51 sections, 6 figures, 12 tables)

This paper contains 51 sections, 6 figures, 12 tables.

Introduction
Related Work
Clinical LLMs and Vendor Differences
Multi-agent LLM Frameworks in Clinical Diagnosis
Mixed-vendor Multi-agent Systems
Methods
MAC Framework
Model Selection
Conversation Protocol and Stopping Rule
Experimental Setup
Datasets
RareBench
DiagnosisArena
Tasks and Metrics
RareBench
...and 36 more sections

Figures (6)

Figure 1: Overview of the 3 different model configurations (Single-LLM, Single-vendor MAC, and Mixed-vendor MAC), and an illustrative example of a case narrative with its corresponding diagnosis list.
Figure 2: Analysis of Correct Prediction Overlap. The x-axis shows the dataset. For the RareBench datasets (MME, HMS, LIRICAL), overlap is calculated using Recall@10. For DiagnosisArena, overlap is calculated using Top-5 Accuracy. In each pair, the left bar compares Mixed-Vendor MAC against the Best Single LLM, and the right bar compares it against the Best Single-Vendor MAC.
Figure 3: $\Delta\text{Coverage}$ across different datasets between Mixed-vendor MAC and Single LLMs. High positive bars indicate the Mixed model subsumes the baseline's knowledge.
Figure 4: Pairwise Jaccard similarity heatmaps between Single LLMs and the Mixed-Vendor MAC.
Figure 5: $\Delta\text{Coverage}$ across different datasets between Mixed-vendor MAC and Single-vendor MAC.
...and 1 more figures

Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?

TL;DR

Abstract

Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?

Authors

TL;DR

Abstract

Table of Contents

Figures (6)