The Potential of LLMs in Medical Education: Generating Questions and Answers for Qualification Exams
Yunqi Zhu, Wen Tang, Huayu Yang, Jinghao Niu, Liyang Dou, Yifan Gu, Yuanyuan Wu, Wensheng Zhang, Ying Sun, Xuebing Yang
TL;DR
The study investigates whether large language models (LLMs) can serve as question setters for medical qualification exams by generating open-ended questions and answers from real-world EHR data. Using the multicenter China Elderly Comorbidity Medical Database (CECMed) and a few-shot prompting framework, eight LLMs are benchmarked against human experts across criteria such as coherence, sufficiency, factual correctness, and professionalism. Results show that certain LLMs (e.g., ERNIE 4, Spark 4, Doubao) approach clinician performance on some question-creation metrics, while human experts still outperform in aspects like sufficiency and correctness for answers; AI revisions guided by expert feedback can improve output quality, though answers generally lag behind humans. The work highlights a feasible AI-assisted path for scalable medical exam content generation, while underscoring limitations like hallucinations, language bias, and reproducibility, and points to future directions such as retrieval-augmented generation and broader, multilingual evaluations to enhance reliability and applicability in medical education.
Abstract
In this work, we leverage LLMs to produce medical qualification exam questions and the corresponding answers through few-shot prompts, investigating in-depth how LLMs meet the requirements in terms of coherence, evidence of statement, factual consistency, and professionalism etc. Utilizing a multicenter bidirectional anonymized database with respect to comorbid chronic diseases, named Elderly Comorbidity Medical Database (CECMed), we tasked LLMs with generating open-ended questions and answers based on a subset of sampled admission reports. For CECMed, the retrospective cohort includes patients enrolled from January 2010 to January 2022 while the prospective cohort from January 2023 to November 2023, with participants sourced from selected tertiary and community hospitals across the southern, northern, and central regions of China. A total of 8 widely used LLMs were used, including ERNIE 4, ChatGLM 4, Doubao, Hunyuan, Spark 4, Qwen, Conventional medical education requires sophisticated clinicians to formulate questions and answers based on prototypes from EHRs, which is heuristic and time-consuming. We found that mainstream LLMs could generate questions and answers with real-world EHRs at levels close to clinicians. Although current LLMs performed dissatisfactory in some aspects, medical students, interns and residents could reasonably make use of LLMs to facilitate understanding.
