Table of Contents
Fetching ...

Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark

Zheqing Li, Yiying Yang, Jiping Lang, Wenhao Jiang, Yuhang Zhao, Shuang Li, Dingqian Wang, Zhu Lin, Xuanna Li, Yuze Tang, Jiexian Qiu, Xiaolin Lu, Hongji Yu, Shuang Chen, Yuhua Bi, Xiaofei Zeng, Yixian Chen, Junrong Chen, Lin Yao

TL;DR

This study introduces GPBench, a competency-based benchmark to evaluate whether large language models can function as general practitioners. It jointly defines a domain-specific competency model and three test sets (MCQ, Clinical Case, AI Patient) with expert annotations, and evaluates ten LLMs across these tasks. Results reveal substantial gaps: even reasoning-optimized and medical-specialist variants struggle with core GP duties such as diagnostic reasoning, treatment planning, and patient history-taking, indicating that current systems require human oversight. GPBench provides a framework to quantify GP-relevant competencies and to guide future improvements toward clinically robust AI-assisted general practice.

Abstract

Large Language Models (LLMs) have demonstrated considerable potential in general practice. However, existing benchmarks and evaluation frameworks primarily depend on exam-style or simplified question-answer formats, lacking a competency-based structure aligned with the real-world clinical responsibilities encountered in general practice. Consequently, the extent to which LLMs can reliably fulfill the duties of general practitioners (GPs) remains uncertain. In this work, we propose a novel evaluation framework to assess the capability of LLMs to function as GPs. Based on this framework, we introduce a general practice benchmark (GPBench), whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards. We evaluate ten state-of-the-art LLMs and analyze their competencies. Our findings indicate that current LLMs are not yet ready for deployment in such settings without human oversight, and further optimization specifically tailored to the daily responsibilities of GPs is essential.

Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark

TL;DR

This study introduces GPBench, a competency-based benchmark to evaluate whether large language models can function as general practitioners. It jointly defines a domain-specific competency model and three test sets (MCQ, Clinical Case, AI Patient) with expert annotations, and evaluates ten LLMs across these tasks. Results reveal substantial gaps: even reasoning-optimized and medical-specialist variants struggle with core GP duties such as diagnostic reasoning, treatment planning, and patient history-taking, indicating that current systems require human oversight. GPBench provides a framework to quantify GP-relevant competencies and to guide future improvements toward clinically robust AI-assisted general practice.

Abstract

Large Language Models (LLMs) have demonstrated considerable potential in general practice. However, existing benchmarks and evaluation frameworks primarily depend on exam-style or simplified question-answer formats, lacking a competency-based structure aligned with the real-world clinical responsibilities encountered in general practice. Consequently, the extent to which LLMs can reliably fulfill the duties of general practitioners (GPs) remains uncertain. In this work, we propose a novel evaluation framework to assess the capability of LLMs to function as GPs. Based on this framework, we introduce a general practice benchmark (GPBench), whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards. We evaluate ten state-of-the-art LLMs and analyze their competencies. Our findings indicate that current LLMs are not yet ready for deployment in such settings without human oversight, and further optimization specifically tailored to the daily responsibilities of GPs is essential.

Paper Structure

This paper contains 28 sections, 6 figures, 10 tables.

Figures (6)

  • Figure 1: An overview of GPBench. Based on the competency model for LLMs, we collected data from open-source datasets and real outpatient medical records from Tertiary A-grade hospitals to create three test sets: MCQ Test Set, Clinical Case Test Set, and AI Patient Test Set. Ground truth and scoring criteria for each case were annotated in detail by experts. Accuracy is used as the evaluation metric for the first test set, while for the other two test sets, experts grade responses based on the annotated scoring criteria.
  • Figure 2: The performance of LLMs on the MCQ Test Set across the primary competency indicators.
  • Figure 3: The performance of LLMs on the Clinical Case Test Set across multiple competency indicators.
  • Figure 4: The performance of LLMs on the AI Patient Test Set for the Medical History Taking (I2-2) indicator.
  • Figure 5: An overview of the competency indicators and their associated importance weights in our evaluation framework.
  • ...and 1 more figures