Table of Contents
Fetching ...

Multi-LLM Collaboration for Medication Recommendation

Huascar Sanchez, Briland Hitaj, Jules Bergmann, Linda Briesemeister

TL;DR

The paper tackles unreliable LLM-driven medication recommendations from unstructured clinical notes by introducing a Chemistry-inspired multi-LLM collaboration framework. It models interaction dynamics through a two-stage generation-evaluation process to achieve efficient, stable, and calibrated ensembles. Evaluation on synthetic clinical vignettes shows that Chemistry-guided ensembles provide competitive accuracy with substantially improved efficiency and robust calibration compared to baselines. The work demonstrates feasibility and sets directions for applying interaction-aware ensembles in real-world clinical decision support, including richer data and retrieval-augmented grounding.

Abstract

As healthcare increasingly turns to AI for scalable and trustworthy clinical decision support, ensuring reliability in model reasoning remains a critical challenge. Individual large language models (LLMs) are susceptible to hallucinations and inconsistency, whereas naive ensembles of models often fail to deliver stable and credible recommendations. Building on our previous work on LLM Chemistry, which quantifies the collaborative compatibility among LLMs, we apply this framework to improve the reliability in medication recommendation from brief clinical vignettes. Our approach leverages multi-LLM collaboration guided by Chemistry-inspired interaction modeling, enabling ensembles that are effective (exploiting complementary strengths), stable (producing consistent quality), and calibrated (minimizing interference and error amplification). We evaluate our Chemistry-based Multi-LLM collaboration strategy on real-world clinical scenarios to investigate whether such interaction-aware ensembles can generate credible, patient-specific medication recommendations. Preliminary results are encouraging, suggesting that LLM Chemistry-guided collaboration may offer a promising path toward reliable and trustworthy AI assistants in clinical practice.

Multi-LLM Collaboration for Medication Recommendation

TL;DR

The paper tackles unreliable LLM-driven medication recommendations from unstructured clinical notes by introducing a Chemistry-inspired multi-LLM collaboration framework. It models interaction dynamics through a two-stage generation-evaluation process to achieve efficient, stable, and calibrated ensembles. Evaluation on synthetic clinical vignettes shows that Chemistry-guided ensembles provide competitive accuracy with substantially improved efficiency and robust calibration compared to baselines. The work demonstrates feasibility and sets directions for applying interaction-aware ensembles in real-world clinical decision support, including richer data and retrieval-augmented grounding.

Abstract

As healthcare increasingly turns to AI for scalable and trustworthy clinical decision support, ensuring reliability in model reasoning remains a critical challenge. Individual large language models (LLMs) are susceptible to hallucinations and inconsistency, whereas naive ensembles of models often fail to deliver stable and credible recommendations. Building on our previous work on LLM Chemistry, which quantifies the collaborative compatibility among LLMs, we apply this framework to improve the reliability in medication recommendation from brief clinical vignettes. Our approach leverages multi-LLM collaboration guided by Chemistry-inspired interaction modeling, enabling ensembles that are effective (exploiting complementary strengths), stable (producing consistent quality), and calibrated (minimizing interference and error amplification). We evaluate our Chemistry-based Multi-LLM collaboration strategy on real-world clinical scenarios to investigate whether such interaction-aware ensembles can generate credible, patient-specific medication recommendations. Preliminary results are encouraging, suggesting that LLM Chemistry-guided collaboration may offer a promising path toward reliable and trustworthy AI assistants in clinical practice.

Paper Structure

This paper contains 16 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Efficiency comparison across sampling strategies. The CHEMISTRY-based multi-LLM ensemble, comprised of Claude models achieved an average generation time of 11 seconds, making it almost 9x faster than the nearest strategy (RANDOM), and nearly 49x faster than LOCAL-only ensembles when recommending medical prescriptions from brief clinical vignettes.
  • Figure 2: Effectiveness comparison across sampling strategies. The CHEMISTRY ensemble achieved an accuracy of 0.78, closely matching the REMOTE strategy (0.84) while outperforming other strategies. The ensemble, composed of Claude models from Anthropic, balances both accuracy and efficiency in medical prescription recommendations.
  • Figure 3: Stability comparison across sampling strategies. The CHEMISTRY ensemble maintained stability comparable to REMOTE ensembles while surpassing LOCAL and RANDOM strategies. Unlike the REMOTE strategy, which exhibited occasional execution failures, the CHEMISTRY strategy showed no failures, providing evidence of its robustness and reliability.
  • Figure 4: Calibration comparison across sampling strategies. The CHEMISTRY ensemble achieved the lowest variance ($0.05$), indicating high inter-model agreement and effective calibration. By contrast, REMOTE and RANDOM ensembles exhibited moderate variance ($0.11$), while LOCAL ensembles showed poor calibration ($1.05$), reflecting weak consensus among constituent LLMs.
  • Figure 5: Sample outputs produced by our CHEMISTRY-based Multi-LLM Recommendation approach addressing the following task: "Recommend Necessary Medical Prescriptions"