Human-Level and Beyond: Benchmarking Large Language Models Against Clinical Pharmacists in Prescription Review
Yan Yang, Mouxiao Bian, Peiling Li, Bingjian Wen, Ruiyao Chen, Kangkun Mao, Xiaojun Ye, Tianbin Li, Pengcheng Chen, Bing Han, Jie Xu, Kaifeng Qiu, Junyan Wu
TL;DR
RxBench introduces a pharmacist-verified, error-type–oriented benchmark to systematically evaluate large language models on prescription review. The study benchmarks 18 frontier LLMs against a standardized 100-item pharmacist test and analyzes performance across single-choice, multiple-choice, and short-answer tasks, identifying clear task-dependent strengths and weaknesses. A domain-specific LoRA fine-tuning of a mid-tier model yields performance rivaling leading general-purpose LLMs in short-answer tasks, underscoring the value of targeted data for safety-critical domains. Overall, the results show that state-of-the-art LLMs can surpass human pharmacists on several prescription-review tasks, highlighting significant potential for AI-assisted clinical decision support while emphasizing the need for careful integration and real-world validation.
Abstract
The rapid advancement of large language models (LLMs) has accelerated their integration into clinical decision support, particularly in prescription review. To enable systematic and fine-grained evaluation, we developed RxBench, a comprehensive benchmark that covers common prescription review categories and consolidates 14 frequent types of prescription errors drawn from authoritative pharmacy references. RxBench consists of 1,150 single-choice, 230 multiple-choice, and 879 short-answer items, all reviewed by experienced clinical pharmacists. We benchmarked 18 state-of-the-art LLMs and identified clear stratification of performance across tasks. Notably, Gemini-2.5-pro-preview-05-06, Grok-4-0709, and DeepSeek-R1-0528 consistently formed the first tier, outperforming other models in both accuracy and robustness. Comparisons with licensed pharmacists indicated that leading LLMs can match or exceed human performance in certain tasks. Furthermore, building on insights from our benchmark evaluation, we performed targeted fine-tuning on a mid-tier model, resulting in a specialized model that rivals leading general-purpose LLMs in performance on short-answer question tasks. The main contribution of RxBench lies in establishing a standardized, error-type-oriented framework that not only reveals the capabilities and limitations of frontier LLMs in prescription review but also provides a foundational resource for building more reliable and specialized clinical tools.
