Table of Contents
Fetching ...

Collaboration among Multiple Large Language Models for Medical Question Answering

Kexin Shang, Chia-Hsuan Chang, Christopher C. Yang

TL;DR

This work tackles the variability and potential unreliability of medical question answering by enabling collaboration among multiple LLMs. It introduces the Iterative Collaboration Framework (ICF), combining Zero-shot Chain-of-Thought with Self-Consistency and a Collaboration Loop to share and re-evaluate reasoning across models. Empirical results on USMLE-style questions show improved consensus and per-model accuracy after collaboration, with a nuanced view of how model confidence and self-consistency relate to performance. The approach is lightweight and interpretable, offering practical implications for safer, collaborative medical QA and future research on cross-model reasoning and reliability signals.

Abstract

Empowered by vast internal knowledge reservoir, the new generation of large language models (LLMs) demonstrate untapped potential to tackle medical tasks. However, there is insufficient effort made towards summoning up a synergic effect from multiple LLMs' expertise and background. In this study, we propose a multi-LLM collaboration framework tailored on a medical multiple-choice questions dataset. Through post-hoc analysis on 3 pre-trained LLM participants, our framework is proved to boost all LLMs reasoning ability as well as alleviate their divergence among questions. We also measure an LLM's confidence when it confronts with adversary opinions from other LLMs and observe a concurrence between LLM's confidence and prediction accuracy.

Collaboration among Multiple Large Language Models for Medical Question Answering

TL;DR

This work tackles the variability and potential unreliability of medical question answering by enabling collaboration among multiple LLMs. It introduces the Iterative Collaboration Framework (ICF), combining Zero-shot Chain-of-Thought with Self-Consistency and a Collaboration Loop to share and re-evaluate reasoning across models. Empirical results on USMLE-style questions show improved consensus and per-model accuracy after collaboration, with a nuanced view of how model confidence and self-consistency relate to performance. The approach is lightweight and interpretable, offering practical implications for safer, collaborative medical QA and future research on cross-model reasoning and reliability signals.

Abstract

Empowered by vast internal knowledge reservoir, the new generation of large language models (LLMs) demonstrate untapped potential to tackle medical tasks. However, there is insufficient effort made towards summoning up a synergic effect from multiple LLMs' expertise and background. In this study, we propose a multi-LLM collaboration framework tailored on a medical multiple-choice questions dataset. Through post-hoc analysis on 3 pre-trained LLM participants, our framework is proved to boost all LLMs reasoning ability as well as alleviate their divergence among questions. We also measure an LLM's confidence when it confronts with adversary opinions from other LLMs and observe a concurrence between LLM's confidence and prediction accuracy.

Paper Structure

This paper contains 25 sections, 5 equations, 3 figures, 5 tables, 2 algorithms.

Figures (3)

  • Figure 1: ICF Framework consisted of two parts: (a) ZS-CoT-SC and (b) Collaboration Loop
  • Figure 2: Prompt templates of Med42 in ZS-CoT-SC and collaboration loop of ICF. (Upper) In ZS-CoT-SC, two sequential prompts $T_{reasoning}$ and $T_{answer}$ jointly formulate the basic ZS-CoT template, where the highlighted part is base LLM's response. Then each ZS-CoT template is applied 10 times on a question via self-consistency. (Lower) Collaboration loop applies $T_{reasoning\_review}$ for every recursive ZS-CoT-SC, presenting all LLM's reasoning pathways and ask for re-decision at the same time.
  • Figure 3: Prompt templates of summerizer LLM $\psi$. In this experiment, we deploy Mixtral to summarize repetitive reasoning pathways from self-consistency sampling. The highlighted part is a integrated context of the majority vote and $n=10$ generated reasonings from one LLM participant in ICF.