Table of Contents
Fetching ...

MedQA-CS: Objective Structured Clinical Examination (OSCE)-Style Benchmark for Evaluating LLM Clinical Skills

Zonghai Yao, Zihao Zhang, Chaolong Tang, Xingyu Bian, Youxia Zhao, Zhichao Yang, Junda Wang, Huixue Zhou, Won Seok Jang, Feiyun Ouyang, Hong Yu

TL;DR

MedQA-CS presents an OSCE-inspired AI-SCE benchmark to systematically evaluate LLM clinical skills via two roles: MedStuLLM (LLM-as-student) and MedExamLLM (LLM-as-examiner). The dataset, built from 44 USMLE Step 2 CS cases paraphrased and expert-validated, contains 1,667 instruction-based data points and supports public, non-commercial use (CC BY-NC 4.0). Across InfoGatherQA, Physical Exams, Closure, and Differential Diagnosis, MedQA-CS demonstrates greater real-world challenge than MCQ benchmarks, and the GPT-4–based MedExamLLM shows high agreement with human experts, establishing LLMs as reliable automated evaluators in clinical-skill contexts. When combined with existing benchmarks, MedQA-CS enables a more comprehensive, cross-model assessment of clinical skills for both open- and closed-source LLMs, with implications for safer AI-assisted clinical workflows.

Abstract

Artificial intelligence (AI) and large language models (LLMs) in healthcare require advanced clinical skills (CS), yet current benchmarks fail to evaluate these comprehensively. We introduce MedQA-CS, an AI-SCE framework inspired by medical education's Objective Structured Clinical Examinations (OSCEs), to address this gap. MedQA-CS evaluates LLMs through two instruction-following tasks, LLM-as-medical-student and LLM-as-CS-examiner, designed to reflect real clinical scenarios. Our contributions include developing MedQA-CS, a comprehensive evaluation framework with publicly available data and expert annotations, and providing the quantitative and qualitative assessment of LLMs as reliable judges in CS evaluation. Our experiments show that MedQA-CS is a more challenging benchmark for evaluating clinical skills than traditional multiple-choice QA benchmarks (e.g., MedQA). Combined with existing benchmarks, MedQA-CS enables a more comprehensive evaluation of LLMs' clinical capabilities for both open- and closed-source LLMs.

MedQA-CS: Objective Structured Clinical Examination (OSCE)-Style Benchmark for Evaluating LLM Clinical Skills

TL;DR

MedQA-CS presents an OSCE-inspired AI-SCE benchmark to systematically evaluate LLM clinical skills via two roles: MedStuLLM (LLM-as-student) and MedExamLLM (LLM-as-examiner). The dataset, built from 44 USMLE Step 2 CS cases paraphrased and expert-validated, contains 1,667 instruction-based data points and supports public, non-commercial use (CC BY-NC 4.0). Across InfoGatherQA, Physical Exams, Closure, and Differential Diagnosis, MedQA-CS demonstrates greater real-world challenge than MCQ benchmarks, and the GPT-4–based MedExamLLM shows high agreement with human experts, establishing LLMs as reliable automated evaluators in clinical-skill contexts. When combined with existing benchmarks, MedQA-CS enables a more comprehensive, cross-model assessment of clinical skills for both open- and closed-source LLMs, with implications for safer AI-assisted clinical workflows.

Abstract

Artificial intelligence (AI) and large language models (LLMs) in healthcare require advanced clinical skills (CS), yet current benchmarks fail to evaluate these comprehensively. We introduce MedQA-CS, an AI-SCE framework inspired by medical education's Objective Structured Clinical Examinations (OSCEs), to address this gap. MedQA-CS evaluates LLMs through two instruction-following tasks, LLM-as-medical-student and LLM-as-CS-examiner, designed to reflect real clinical scenarios. Our contributions include developing MedQA-CS, a comprehensive evaluation framework with publicly available data and expert annotations, and providing the quantitative and qualitative assessment of LLMs as reliable judges in CS evaluation. Our experiments show that MedQA-CS is a more challenging benchmark for evaluating clinical skills than traditional multiple-choice QA benchmarks (e.g., MedQA). Combined with existing benchmarks, MedQA-CS enables a more comprehensive evaluation of LLMs' clinical capabilities for both open- and closed-source LLMs.
Paper Structure (68 sections, 3 equations, 2 figures, 45 tables)

This paper contains 68 sections, 3 equations, 2 figures, 45 tables.

Figures (2)

  • Figure 1: Miller's pyramid of clinical competence matched with an appropriate level of assessment. Figure adapted from miller1990assessment.
  • Figure 2: Overview of the United States Medical Licensing Examination (USMLE) Step2 Clinical Skills (CS). The medical student begins by reviewing the doorway information (Phase ①), then gathers the patient's history ②, performs a physical examination ③, concludes with the closure phase ④, and documents the encounter in a patient note with a differential diagnosis ⑤. Throughout these phases, the Clinical Skills Examiner plays the role of the patient, interacting with the Medical Student to simulate a real clinical encounter and assess their clinical skills. The examiner provides feedback and scores the student’s performance based on predefined criteria. This OSCE structured approach ensures a comprehensive assessment of the student's ability to conduct patient encounters effectively and professionally. Our main objective is to transform this OSCE into an AISCE for LLM Clinical Skills benchmarking. Therefore, throughout the process, there will be tasks for both MedStuLLM (LLM-as-student) and MedExamLLM (LLM-as-examiner) that the LLM needs to complete. The goal for MedStuLLM is to achieve a better AI-SCE score to demonstrate its clinical skills, while the goal for MedExamLLM is to have a high correlation with the expert examiner’s scoring results to prove its capability as a judge in the clinical domain. More details about USMLE STEP2 CS can be found in appendix \ref{['apx:sec:overview_usmle']} and one example in appendix \ref{['apx:sec:example_usmle']}.