MedQA-CS: Objective Structured Clinical Examination (OSCE)-Style Benchmark for Evaluating LLM Clinical Skills
Zonghai Yao, Zihao Zhang, Chaolong Tang, Xingyu Bian, Youxia Zhao, Zhichao Yang, Junda Wang, Huixue Zhou, Won Seok Jang, Feiyun Ouyang, Hong Yu
TL;DR
MedQA-CS presents an OSCE-inspired AI-SCE benchmark to systematically evaluate LLM clinical skills via two roles: MedStuLLM (LLM-as-student) and MedExamLLM (LLM-as-examiner). The dataset, built from 44 USMLE Step 2 CS cases paraphrased and expert-validated, contains 1,667 instruction-based data points and supports public, non-commercial use (CC BY-NC 4.0). Across InfoGatherQA, Physical Exams, Closure, and Differential Diagnosis, MedQA-CS demonstrates greater real-world challenge than MCQ benchmarks, and the GPT-4–based MedExamLLM shows high agreement with human experts, establishing LLMs as reliable automated evaluators in clinical-skill contexts. When combined with existing benchmarks, MedQA-CS enables a more comprehensive, cross-model assessment of clinical skills for both open- and closed-source LLMs, with implications for safer AI-assisted clinical workflows.
Abstract
Artificial intelligence (AI) and large language models (LLMs) in healthcare require advanced clinical skills (CS), yet current benchmarks fail to evaluate these comprehensively. We introduce MedQA-CS, an AI-SCE framework inspired by medical education's Objective Structured Clinical Examinations (OSCEs), to address this gap. MedQA-CS evaluates LLMs through two instruction-following tasks, LLM-as-medical-student and LLM-as-CS-examiner, designed to reflect real clinical scenarios. Our contributions include developing MedQA-CS, a comprehensive evaluation framework with publicly available data and expert annotations, and providing the quantitative and qualitative assessment of LLMs as reliable judges in CS evaluation. Our experiments show that MedQA-CS is a more challenging benchmark for evaluating clinical skills than traditional multiple-choice QA benchmarks (e.g., MedQA). Combined with existing benchmarks, MedQA-CS enables a more comprehensive evaluation of LLMs' clinical capabilities for both open- and closed-source LLMs.
