Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark

Zheqing Li; Yiying Yang; Jiping Lang; Wenhao Jiang; Yuhang Zhao; Shuang Li; Dingqian Wang; Zhu Lin; Xuanna Li; Yuze Tang; Jiexian Qiu; Xiaolin Lu; Hongji Yu; Shuang Chen; Yuhua Bi; Xiaofei Zeng; Yixian Chen; Junrong Chen; Lin Yao

Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark

Zheqing Li, Yiying Yang, Jiping Lang, Wenhao Jiang, Yuhang Zhao, Shuang Li, Dingqian Wang, Zhu Lin, Xuanna Li, Yuze Tang, Jiexian Qiu, Xiaolin Lu, Hongji Yu, Shuang Chen, Yuhua Bi, Xiaofei Zeng, Yixian Chen, Junrong Chen, Lin Yao

TL;DR

This study introduces GPBench, a competency-based benchmark to evaluate whether large language models can function as general practitioners. It jointly defines a domain-specific competency model and three test sets (MCQ, Clinical Case, AI Patient) with expert annotations, and evaluates ten LLMs across these tasks. Results reveal substantial gaps: even reasoning-optimized and medical-specialist variants struggle with core GP duties such as diagnostic reasoning, treatment planning, and patient history-taking, indicating that current systems require human oversight. GPBench provides a framework to quantify GP-relevant competencies and to guide future improvements toward clinically robust AI-assisted general practice.

Abstract

Large Language Models (LLMs) have demonstrated considerable potential in general practice. However, existing benchmarks and evaluation frameworks primarily depend on exam-style or simplified question-answer formats, lacking a competency-based structure aligned with the real-world clinical responsibilities encountered in general practice. Consequently, the extent to which LLMs can reliably fulfill the duties of general practitioners (GPs) remains uncertain. In this work, we propose a novel evaluation framework to assess the capability of LLMs to function as GPs. Based on this framework, we introduce a general practice benchmark (GPBench), whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards. We evaluate ten state-of-the-art LLMs and analyze their competencies. Our findings indicate that current LLMs are not yet ready for deployment in such settings without human oversight, and further optimization specifically tailored to the daily responsibilities of GPs is essential.

Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark

TL;DR

Abstract

Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)