Table of Contents
Fetching ...

The Potential of LLMs in Medical Education: Generating Questions and Answers for Qualification Exams

Yunqi Zhu, Wen Tang, Huayu Yang, Jinghao Niu, Liyang Dou, Yifan Gu, Yuanyuan Wu, Wensheng Zhang, Ying Sun, Xuebing Yang

TL;DR

The study investigates whether large language models (LLMs) can serve as question setters for medical qualification exams by generating open-ended questions and answers from real-world EHR data. Using the multicenter China Elderly Comorbidity Medical Database (CECMed) and a few-shot prompting framework, eight LLMs are benchmarked against human experts across criteria such as coherence, sufficiency, factual correctness, and professionalism. Results show that certain LLMs (e.g., ERNIE 4, Spark 4, Doubao) approach clinician performance on some question-creation metrics, while human experts still outperform in aspects like sufficiency and correctness for answers; AI revisions guided by expert feedback can improve output quality, though answers generally lag behind humans. The work highlights a feasible AI-assisted path for scalable medical exam content generation, while underscoring limitations like hallucinations, language bias, and reproducibility, and points to future directions such as retrieval-augmented generation and broader, multilingual evaluations to enhance reliability and applicability in medical education.

Abstract

In this work, we leverage LLMs to produce medical qualification exam questions and the corresponding answers through few-shot prompts, investigating in-depth how LLMs meet the requirements in terms of coherence, evidence of statement, factual consistency, and professionalism etc. Utilizing a multicenter bidirectional anonymized database with respect to comorbid chronic diseases, named Elderly Comorbidity Medical Database (CECMed), we tasked LLMs with generating open-ended questions and answers based on a subset of sampled admission reports. For CECMed, the retrospective cohort includes patients enrolled from January 2010 to January 2022 while the prospective cohort from January 2023 to November 2023, with participants sourced from selected tertiary and community hospitals across the southern, northern, and central regions of China. A total of 8 widely used LLMs were used, including ERNIE 4, ChatGLM 4, Doubao, Hunyuan, Spark 4, Qwen, Conventional medical education requires sophisticated clinicians to formulate questions and answers based on prototypes from EHRs, which is heuristic and time-consuming. We found that mainstream LLMs could generate questions and answers with real-world EHRs at levels close to clinicians. Although current LLMs performed dissatisfactory in some aspects, medical students, interns and residents could reasonably make use of LLMs to facilitate understanding.

The Potential of LLMs in Medical Education: Generating Questions and Answers for Qualification Exams

TL;DR

The study investigates whether large language models (LLMs) can serve as question setters for medical qualification exams by generating open-ended questions and answers from real-world EHR data. Using the multicenter China Elderly Comorbidity Medical Database (CECMed) and a few-shot prompting framework, eight LLMs are benchmarked against human experts across criteria such as coherence, sufficiency, factual correctness, and professionalism. Results show that certain LLMs (e.g., ERNIE 4, Spark 4, Doubao) approach clinician performance on some question-creation metrics, while human experts still outperform in aspects like sufficiency and correctness for answers; AI revisions guided by expert feedback can improve output quality, though answers generally lag behind humans. The work highlights a feasible AI-assisted path for scalable medical exam content generation, while underscoring limitations like hallucinations, language bias, and reproducibility, and points to future directions such as retrieval-augmented generation and broader, multilingual evaluations to enhance reliability and applicability in medical education.

Abstract

In this work, we leverage LLMs to produce medical qualification exam questions and the corresponding answers through few-shot prompts, investigating in-depth how LLMs meet the requirements in terms of coherence, evidence of statement, factual consistency, and professionalism etc. Utilizing a multicenter bidirectional anonymized database with respect to comorbid chronic diseases, named Elderly Comorbidity Medical Database (CECMed), we tasked LLMs with generating open-ended questions and answers based on a subset of sampled admission reports. For CECMed, the retrospective cohort includes patients enrolled from January 2010 to January 2022 while the prospective cohort from January 2023 to November 2023, with participants sourced from selected tertiary and community hospitals across the southern, northern, and central regions of China. A total of 8 widely used LLMs were used, including ERNIE 4, ChatGLM 4, Doubao, Hunyuan, Spark 4, Qwen, Conventional medical education requires sophisticated clinicians to formulate questions and answers based on prototypes from EHRs, which is heuristic and time-consuming. We found that mainstream LLMs could generate questions and answers with real-world EHRs at levels close to clinicians. Although current LLMs performed dissatisfactory in some aspects, medical students, interns and residents could reasonably make use of LLMs to facilitate understanding.

Paper Structure

This paper contains 7 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Evolution of the "AI + Medicine" paradigm.
  • Figure 2: Overall pipeline. A. Based on a real-world Chinese elderly chronic disease database, an AI model was employed to generate patient reports. B. Medical experts wrote free-form reference Q&A for a subset of the reports. C. The reference report-question pairs were used as prior information, combined with prompt templates, enabling the AI model to generate questions for the remaining reports. D. The reference Q&A pairs were employed as prior information, integrated with prompt templates, leveraging the AI model to generate answers for the questions produced in phase C. E. human experts reviewed and revised a small number of Q&A pairs. F. The Q&A-review pairs were used as prior information, integrated with prompt templates, leveraging the AI model to decide whether the remaining Q&A pairs require revision, and to provide revised answers if necessary. G. Based on multiple evaluation aspects, an independent group of human experts assessed the AI-generated Q&A through scoring.
  • Figure 3: Human evaluation on question generation. Each LLM generated questions based on the same sampled set of admission reports, and each human expert provided scores for all criteria, with the scoring scale ranging from integers 1 to 5, wherein higher scores denoted better performance. Error bars depict the standard deviation of the mean scores.
  • Figure 4: Human evaluation on answer generation. Each LLM generated answers based on the same sampled set of AI-generated questions, and each human expert provided scores for all criteria, with the scoring scale ranging from integers 1 to 5, wherein higher scores denoted better performance. Error bars depict the standard deviation of the mean scores.
  • Figure 5: Human evaluation on answer revision. In the histogram, the original answers are unfilled, while the answers based on AI revision are filled with diagonal stripes. Each LLM generated answers based on the same sampled set of AI-generated questions, and each human expert provided scores for all criteria, with the scoring scale ranging from integers 1 to 5, wherein higher scores denoted better performance. Error bars depict the standard deviation of the mean scores.