Collaboration among Multiple Large Language Models for Medical Question Answering
Kexin Shang, Chia-Hsuan Chang, Christopher C. Yang
TL;DR
This work tackles the variability and potential unreliability of medical question answering by enabling collaboration among multiple LLMs. It introduces the Iterative Collaboration Framework (ICF), combining Zero-shot Chain-of-Thought with Self-Consistency and a Collaboration Loop to share and re-evaluate reasoning across models. Empirical results on USMLE-style questions show improved consensus and per-model accuracy after collaboration, with a nuanced view of how model confidence and self-consistency relate to performance. The approach is lightweight and interpretable, offering practical implications for safer, collaborative medical QA and future research on cross-model reasoning and reliability signals.
Abstract
Empowered by vast internal knowledge reservoir, the new generation of large language models (LLMs) demonstrate untapped potential to tackle medical tasks. However, there is insufficient effort made towards summoning up a synergic effect from multiple LLMs' expertise and background. In this study, we propose a multi-LLM collaboration framework tailored on a medical multiple-choice questions dataset. Through post-hoc analysis on 3 pre-trained LLM participants, our framework is proved to boost all LLMs reasoning ability as well as alleviate their divergence among questions. We also measure an LLM's confidence when it confronts with adversary opinions from other LLMs and observe a concurrence between LLM's confidence and prediction accuracy.
