Table of Contents
Fetching ...

Human-AI collectives produce the most accurate differential diagnoses

N. Zöller, J. Berger, I. Lin, N. Fu, J. Komarneni, G. Barabucci, K. Laskowski, V. Shia, B. Harack, E. A. Chu, V. Trianni, R. H. J. M. Kurvers, S. M. Herzog

TL;DR

It is shown that hybrid collectives of physicians and LLMs outperform both single physicians and physician collectives, as well as single LLMs and LLM ensembles, as well as single LLMs and LLM ensembles in open-ended medical diagnostics.

Abstract

Artificial intelligence systems, particularly large language models (LLMs), are increasingly being employed in high-stakes decisions that impact both individuals and society at large, often without adequate safeguards to ensure safety, quality, and equity. Yet LLMs hallucinate, lack common sense, and are biased - shortcomings that may reflect LLMs' inherent limitations and thus may not be remedied by more sophisticated architectures, more data, or more human feedback. Relying solely on LLMs for complex, high-stakes decisions is therefore problematic. Here we present a hybrid collective intelligence system that mitigates these risks by leveraging the complementary strengths of human experience and the vast information processed by LLMs. We apply our method to open-ended medical diagnostics, combining 40,762 differential diagnoses made by physicians with the diagnoses of five state-of-the art LLMs across 2,133 medical cases. We show that hybrid collectives of physicians and LLMs outperform both single physicians and physician collectives, as well as single LLMs and LLM ensembles. This result holds across a range of medical specialties and professional experience, and can be attributed to humans' and LLMs' complementary contributions that lead to different kinds of errors. Our approach highlights the potential for collective human and machine intelligence to improve accuracy in complex, open-ended domains like medical diagnostics.

Human-AI collectives produce the most accurate differential diagnoses

TL;DR

It is shown that hybrid collectives of physicians and LLMs outperform both single physicians and physician collectives, as well as single LLMs and LLM ensembles, as well as single LLMs and LLM ensembles in open-ended medical diagnostics.

Abstract

Artificial intelligence systems, particularly large language models (LLMs), are increasingly being employed in high-stakes decisions that impact both individuals and society at large, often without adequate safeguards to ensure safety, quality, and equity. Yet LLMs hallucinate, lack common sense, and are biased - shortcomings that may reflect LLMs' inherent limitations and thus may not be remedied by more sophisticated architectures, more data, or more human feedback. Relying solely on LLMs for complex, high-stakes decisions is therefore problematic. Here we present a hybrid collective intelligence system that mitigates these risks by leveraging the complementary strengths of human experience and the vast information processed by LLMs. We apply our method to open-ended medical diagnostics, combining 40,762 differential diagnoses made by physicians with the diagnoses of five state-of-the art LLMs across 2,133 medical cases. We show that hybrid collectives of physicians and LLMs outperform both single physicians and physician collectives, as well as single LLMs and LLM ensembles. This result holds across a range of medical specialties and professional experience, and can be attributed to humans' and LLMs' complementary contributions that lead to different kinds of errors. Our approach highlights the potential for collective human and machine intelligence to improve accuracy in complex, open-ended domains like medical diagnostics.
Paper Structure (26 sections, 1 equation, 14 figures)

This paper contains 26 sections, 1 equation, 14 figures.

Figures (14)

  • Figure 1: Illustration of the hybrid collective intelligence process, which combines human diagnoses with LLM outputs to arrive at a collective differential diagnosis.a, Screenshot of the interface that human users see when diagnosing a patient case on the Human Dx platform via a mobile device. The information provided can include a patient's symptoms, test results, and medical record. Users can uncover this information piece by piece and update their diagnosis accordingly. In this analysis, we only consider users' final differential diagnosis. The same information shown to human users is also given to LLMs as part of a prompt (see Methods). b, An illustrative example of the open-ended text responses given by users and LLMs. Next, extending a method presented in kurvers2023 (see Methods and Extended Data Fig. \ref{['fig:prompt_engineering']}), each single diagnosis is subjected to several processing steps for standardization, after which it is assigned a unique ID in the SNOMED CT healthcare terminology. c, Example of a SNOMED CT entry. Crucially, all listed synonyms are matched to the same SNOMED CT ID. d, Diagnoses of humans and LLMs after the matching step. e, Collective diagnosis after aggregating the diagnoses from humans and LLMs. In this aggregation, LLMs and humans are assigned different weights based on their performance in the training fold. The rank $r$ of a diagnosis in a differential diagnosis is taken into account through a $1/r$ scoring rule (see Methods).
  • Figure 2: Cross-validated performance of five individual LLMs (Anthropic Claude 3 Opus, OpenAI GPT-4, Mistral Large, Google Gemini Pro 1.0 and Meta Llama 2 70B) and ensembles of all possible combinations of LLMs. Panels show performance for four outcome metrics ($y$ axes): Top-$k$ indicates the proportion of cases for which the correct diagnosis was among the $k$ top-ranked diagnoses (for $k = \{1,3, 5\}$); MRR shows the mean reciprocal rank of correct diagnoses across cases (see eq. \ref{['eq:MRR']}). The $x$ axis shows the number of LLMs in an ensemble. The horizontal dashed line shows the average individual performance of the physicians (i.e., first averaged within cases, then across all cases). Some of the ensembles overplot each other (see Table \ref{['tbl:LLMs']} in the supplement for the performance of all combinations).
  • Figure 3: Cross-validated performance of human-only ensembles and hybrid ensembles of humans and LLMs. Panels show performance for four outcome metrics ($y$ axes): Top-$k$ indicates the proportion of cases for which the correct diagnosis was among the $k$ top-ranked diagnoses (for $k = \{1,3, 5\}$); MRR shows the mean reciprocal rank of correct diagnoses across cases (see eq. \ref{['eq:MRR']}). The individual performance of the five LLMs (and their combined performance in an all-LLMs ensemble) is shown as the left-most square of each color in each panel. The $x$ axis shows the number of humans added to individual LLMs or to an all-LLMs ensemble.
  • Figure 4: Complementarity of solutions from individual humans and human-only ensembles and LLMs. Panels a and b show, for each of the five LLMs, matrices with the percentages of cases for all 36 combinations of the LLM ($x$ axis) and humans ($y$ axis) assigning the correct diagnosis a particular rank (i.e., rank 1, 2, 3, 4, 5 or not ranked). a, Results for individual physicians. b, Results for five-physician human-only ensembles. The highlighted diagonal indicates cases where an LLM and the humans assigned the correct diagnosis the same rank. Panels c and d show the percentage of cases in which the same diagnoses were assigned rank one, comparing individual physicians and 5-physician ensembles to LLMs (left side), and different LLMs to each other (right side). c, Overall rank one agreement, regardless of whether the correct diagnosis was included. d, Rank one agreement when both diagnosticians were incorrect. Results were extracted from the cross-validation procedure by recording the frequencies with which physicians and LLMs assigned the same or a different rank to either the correct or an incorrect diagnosis, averaged across all cases and the five folds (see Methods). Note that due to rounding to integers, there may be small inconsistencies when summing rows or columns across matrices or when comparing sums of values to respective percentages reported in the main text.
  • Figure S1: Illustration of LLM prompt engineering and validation method. We nested our prompt engineering and Weighted Majority Voting Ensemble (WMVE) Dogan2019 sequence in a five-fold cross-validation procedure. First, we determined which prompt performed best for each LLM in the training fold (one-fifth of the data; see Methods). Second, we calculated weights for each member of the ensemble, also using the training fold. The weights were then used to aggregate collective diagnoses and evaluate the ensemble's performance on the remaining folds (four-fifths of the data). This process yields one result per fold, of which the averages are reported in the main text. We repeated this procedure for every metric reported in the main text (i.e., top-1, top-3, top-5 and MRR).
  • ...and 9 more figures