Table of Contents
Fetching ...

A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing

Naeimeh Nourmohammadi, Md Meem Hossain, The Anh Han, Safina Showkat Ara, Zia Ush Shamszaman

TL;DR

The results indicate that agent specialisation and verification layers can mitigate key single-model limitations and provide a practical, extensible design for evidence-based and bias-aware medical AI.

Abstract

Large language models (LLMs) show promise for healthcare question answering, but clinical use is limited by weak verification, insufficient evidence grounding, and unreliable confidence signalling. We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability. Our approach has two phases. First, we fine-tune three representative LLM families (GPT, LLaMA, and DeepSeek R1) on MedQuAD-derived medical QA data (20k+ question-answer pairs across multiple NIH domains) and benchmark generation quality. DeepSeek R1 achieves the strongest scores (ROUGE-1 0.536 +- 0.04; ROUGE-2 0.226 +-0.03; BLEU 0.098 -+ 0.018) and substantially outperforms the specialised biomedical baseline BioGPT in zero-shot evaluation. Second, we implement a modular multi-agent pipeline in which a Clinical Reasoning agent (fine-tuned LLaMA) produces structured explanations, an Evidence Retrieval agent queries PubMed to ground responses in recent literature, and a Refinement agent (DeepSeek R1) improves clarity and factual consistency; an optional human validation path is triggered for high-risk or high-uncertainty cases. Safety mechanisms include Monte Carlo dropout and perplexity-based uncertainty scoring, plus lexical and sentiment-based bias detection supported by LIME/SHAP-based analyses. In evaluation, the full system achieves 87% accuracy with relevance around 0.80, and evidence augmentation reduces uncertainty (perplexity 4.13) compared to base responses, with mean end-to-end latency of 36.5 seconds under the reported configuration. Overall, the results indicate that agent specialisation and verification layers can mitigate key single-model limitations and provide a practical, extensible design for evidence-based and bias-aware medical AI.

A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing

TL;DR

The results indicate that agent specialisation and verification layers can mitigate key single-model limitations and provide a practical, extensible design for evidence-based and bias-aware medical AI.

Abstract

Large language models (LLMs) show promise for healthcare question answering, but clinical use is limited by weak verification, insufficient evidence grounding, and unreliable confidence signalling. We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability. Our approach has two phases. First, we fine-tune three representative LLM families (GPT, LLaMA, and DeepSeek R1) on MedQuAD-derived medical QA data (20k+ question-answer pairs across multiple NIH domains) and benchmark generation quality. DeepSeek R1 achieves the strongest scores (ROUGE-1 0.536 +- 0.04; ROUGE-2 0.226 +-0.03; BLEU 0.098 -+ 0.018) and substantially outperforms the specialised biomedical baseline BioGPT in zero-shot evaluation. Second, we implement a modular multi-agent pipeline in which a Clinical Reasoning agent (fine-tuned LLaMA) produces structured explanations, an Evidence Retrieval agent queries PubMed to ground responses in recent literature, and a Refinement agent (DeepSeek R1) improves clarity and factual consistency; an optional human validation path is triggered for high-risk or high-uncertainty cases. Safety mechanisms include Monte Carlo dropout and perplexity-based uncertainty scoring, plus lexical and sentiment-based bias detection supported by LIME/SHAP-based analyses. In evaluation, the full system achieves 87% accuracy with relevance around 0.80, and evidence augmentation reduces uncertainty (perplexity 4.13) compared to base responses, with mean end-to-end latency of 36.5 seconds under the reported configuration. Overall, the results indicate that agent specialisation and verification layers can mitigate key single-model limitations and provide a practical, extensible design for evidence-based and bias-aware medical AI.
Paper Structure (64 sections, 3 equations, 14 figures, 5 tables)

This paper contains 64 sections, 3 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Transformer Architecture Comparison: (Left) GPT decoder-based architecture processes input through learned positional embeddings combined with token embeddings, applies masked multi-head self-attention (preventing attention to future tokens), followed by feedforward layers with GeLU activation. Layer normalisation (epsilon 1e-5) ensures training stability. (Right) LLaMA architecture employs RMSNorm for improved stability, Rotary Positional Embeddings (RoPE) for efficient position encoding, Flash Attention for memory optimization, and SwiGLU activation, optimised for efficiency in large-scale medical language modelling Vaswani2017AttentionNVIDIA2025TransformerEngine.
  • Figure 2: DeepSeek Architecture Overview: This figure illustrates the DeepSeek model's Mixture-of-Experts (MoE) framework, where a Router Network selectively activates a subset of expert models per token, optimising computational efficiencyGeeksForGeeks2025DeepSeekR1.
  • Figure 3: Data Processing Workflow: Medical question-answer datasets undergo memory-efficient loading in chunks (shards), followed by tokenisation using different methods for each model (GPT uses Byte-Pair Encoding; LLaMA and DeepSeek use SentencePiece). Text sequences are padded to consistent lengths, then organised into training batches. This pipeline ensures efficient processing of large medical datasets whilst maintaining compatibility across all three model architectures.
  • Figure 4: Comparison of tokenisation approaches across architectures
  • Figure 5: Training Dynamics of GPT and LLaMA: The left plot illustrates the learning rate schedule across training steps for GPT (blue) and LLaMA (red), showing a peak followed by a gradual decrease.
  • ...and 9 more figures