A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction

Farzad Ahmed; Joniel Augustine Jerome; Meliha Yetisgen; Özlem Uzuner

A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction

Farzad Ahmed, Joniel Augustine Jerome, Meliha Yetisgen, Özlem Uzuner

TL;DR

Medical documentation errors threaten patient safety, motivating automated detection and correction. The study introduces RAG-enabled Dynamic Prompting (RDP) and systematically compares zero-shot, SPR, and RDP prompting across nine instruction-tuned LLMs on the MEDEC/ MEDIQA-CORR tasks for error flag, error sentence, and error correction. Across models, RDP improves recall for error sentence detection, reduces false positives in error flag detection, and enhances correction quality, with statistically significant gains over SPR and zero-shot. Controlled oracle analyses and qualitative breakdown show RDP brings predictions closer to clinician reasoning and reduces common error modes, supporting a hybrid workflow where LLMs handle broad screening and clinicians perform nuanced corrections. Overall, RDP offers a scalable, clinically grounded pathway to improve medical documentation quality, while highlighting ongoing gaps in cross-sentence reasoning and rare entities that demand continued research.

Abstract

Objective: Clinical documentation contains factual, diagnostic, and management errors that can compromise patient safety. Large language models (LLMs) may help detect and correct such errors, but their behavior under different prompting strategies remains unclear. We evaluate zero-shot prompting, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for three subtasks of medical error processing: error flag detection, error sentence detection, and error correction. Methods: Using the MEDEC dataset, we evaluated nine instruction-tuned LLMs (GPT, Claude, Gemini, and OpenAI o-series models). We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate score of ROUGE-1, BLEURT, and BERTScore for error correction. We also analyzed example outputs to identify failure modes and differences between LLM and clinician reasoning. Results: Zero-shot prompting showed low recall in both detection tasks, often missing abbreviation-heavy or atypical errors. SPR improved recall but increased FPR. Across all nine LLMs, RDP reduced FPR by about 15 percent, improved recall by 5 to 10 percent in error sentence detection, and generated more contextually accurate corrections. Conclusion: Across diverse LLMs, RDP outperforms zero-shot and SPR prompting. Using retrieved exemplars improves detection accuracy, reduces false positives, and enhances the reliability of medical error correction.

A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction

TL;DR

Abstract

A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)