Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams

Zheheng Luo; Chenhan Yuan; Qianqian Xie; Sophia Ananiadou

Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams

Zheheng Luo, Chenhan Yuan, Qianqian Xie, Sophia Ananiadou

TL;DR

This work addresses the limitation that existing healthcare benchmarks largely center on physicians and fail to cover the breadth of healthcare professions. It introduces EMPEC, a large-scale, time-stamped Chinese exam-question dataset spanning 20 professions, 124 subjects, and 157,803 questions, enabling rigorous cross-professional evaluation of LLMs. Across 17 models, general-purpose LLMs often outperform medical-domain variants, with EMPEC-based training providing substantial gains and results generalizing to unseen and later-released questions; traditional vs simplified Chinese has little impact, underscoring linguistic robustness. EMPEC thus offers a comprehensive benchmark to assess and guide the development of AI systems intended for real-world healthcare knowledge tasks beyond physician-only scenarios.

Abstract

Recent advancements in Large Language Models (LLMs) have demonstrated their potential in delivering accurate answers to questions about world knowledge. Despite this, existing benchmarks for evaluating LLMs in healthcare predominantly focus on medical doctors, leaving other critical healthcare professions underrepresented. To fill this research gap, we introduce the Examinations for Medical Personnel in Chinese (EMPEC), a pioneering large-scale healthcare knowledge benchmark in traditional Chinese. EMPEC consists of 157,803 exam questions across 124 subjects and 20 healthcare professions, including underrepresented occupations like Optometrists and Audiologists. Each question is tagged with its release time and source, ensuring relevance and authenticity. We conducted extensive experiments on 17 LLMs, including proprietary, open-source models, general domain models and medical specific models, evaluating their performance under various settings. Our findings reveal that while leading models like GPT-4 achieve over 75\% accuracy, they still struggle with specialized fields and alternative medicine. Surprisingly, general-purpose LLMs outperformed medical-specific models, and incorporating EMPEC's training data significantly enhanced performance. Additionally, the results on questions released after the models' training cutoff date were consistent with overall performance trends, suggesting that the models' performance on the test set can predict their effectiveness in addressing unseen healthcare-related queries. The transition from traditional to simplified Chinese characters had a negligible impact on model performance, indicating robust linguistic versatility. Our study underscores the importance of expanding benchmarks to cover a broader range of healthcare professions to better assess the applicability of LLMs in real-world healthcare scenarios.

Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams

TL;DR

Abstract

Paper Structure (28 sections, 4 figures, 7 tables)

This paper contains 28 sections, 4 figures, 7 tables.

Introduction
Existing medical benchmarks are limited in scope and authenticity
Related work
Healthcare Knowledge Benchmark
Knowledge-related Benchmarks for LLM
The EMPEC Dataset
Data Collection and Pre-processing
Dataset Statistics
Dataset Characteristics
Benchmark
Tested Models
General LLMs
Medical Domain LLMs
Evaluation settings
Analysis
...and 13 more sections

Figures (4)

Figure 1: An example of questions in EMPEC, texts in blue are English translations of the original Chinese question and answers.
Figure 2: Distribution of professions in the EMPEC dataset. The left panel illustrates the total number of questions attributed to each healthcare profession within the dataset. The right panel provides a visual representation of the proportionate distribution of questions across the various professions.
Figure 3: Performance of models on traditional Chinese and simplified Chinese.
Figure 4: The prompt used in the zero-shot evaluation and supervised fine-tuning. The texts in blue are the English translations of the Chinese content.

Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams

TL;DR

Abstract

Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams

Authors

TL;DR

Abstract

Table of Contents

Figures (4)