Table of Contents
Fetching ...

Leveraging Large Language Model as Simulated Patients for Clinical Education

Yanzeng Li, Cheng Zeng, Jialun Zhong, Ruoyu Zhang, Minhao Zhang, Lei Zou

TL;DR

The paper tackles the bottleneck of traditional SP-based clinical training by proposing CureFun, a model-agnostic framework that uses LLMs to simulate patient encounters in an education setting. It combines a graph-driven context-adaptive SP chatbot (ERRG) with retrieval-augmented generation over a case graph and an automated, ensemble-based assessment module to standardize dialogue and feedback. Empirical results on eight Chinese SP cases show CureFun produces more authentic SP dialogue flows than baseline LLM chatbots and yields automated scores that strongly align with human grading (mean correlations around 0.81–0.85, p<0.05). The study also evaluates LLMs as virtual doctors, finding that while top models approach human performance in conversational aspects, human clinicians still outperform in diagnostic accuracy, underscoring the need for integrated VSP-VD training for scalable clinical education.

Abstract

Simulated Patients (SPs) play a crucial role in clinical medical education by providing realistic scenarios for student practice. However, the high cost of training and hiring qualified SPs, along with the heavy workload and potential risks they face in consistently portraying actual patients, limit students' access to this type of clinical training. Consequently, the integration of computer program-based simulated patients has emerged as a valuable educational tool in recent years. With the rapid development of Large Language Models (LLMs), their exceptional capabilities in conversational artificial intelligence and role-playing have been demonstrated, making them a feasible option for implementing Virtual Simulated Patient (VSP). In this paper, we present an integrated model-agnostic framework called CureFun that harnesses the potential of LLMs in clinical medical education. This framework facilitates natural conversations between students and simulated patients, evaluates their dialogue, and provides suggestions to enhance students' clinical inquiry skills. Through comprehensive evaluations, our approach demonstrates more authentic and professional SP-scenario dialogue flows compared to other LLM-based chatbots, thus proving its proficiency in simulating patients. Additionally, leveraging CureFun's evaluation ability, we assess several medical LLMs and discuss the possibilities and limitations of using LLMs as virtual doctors from the perspective of their diagnostic abilities.

Leveraging Large Language Model as Simulated Patients for Clinical Education

TL;DR

The paper tackles the bottleneck of traditional SP-based clinical training by proposing CureFun, a model-agnostic framework that uses LLMs to simulate patient encounters in an education setting. It combines a graph-driven context-adaptive SP chatbot (ERRG) with retrieval-augmented generation over a case graph and an automated, ensemble-based assessment module to standardize dialogue and feedback. Empirical results on eight Chinese SP cases show CureFun produces more authentic SP dialogue flows than baseline LLM chatbots and yields automated scores that strongly align with human grading (mean correlations around 0.81–0.85, p<0.05). The study also evaluates LLMs as virtual doctors, finding that while top models approach human performance in conversational aspects, human clinicians still outperform in diagnostic accuracy, underscoring the need for integrated VSP-VD training for scalable clinical education.

Abstract

Simulated Patients (SPs) play a crucial role in clinical medical education by providing realistic scenarios for student practice. However, the high cost of training and hiring qualified SPs, along with the heavy workload and potential risks they face in consistently portraying actual patients, limit students' access to this type of clinical training. Consequently, the integration of computer program-based simulated patients has emerged as a valuable educational tool in recent years. With the rapid development of Large Language Models (LLMs), their exceptional capabilities in conversational artificial intelligence and role-playing have been demonstrated, making them a feasible option for implementing Virtual Simulated Patient (VSP). In this paper, we present an integrated model-agnostic framework called CureFun that harnesses the potential of LLMs in clinical medical education. This framework facilitates natural conversations between students and simulated patients, evaluates their dialogue, and provides suggestions to enhance students' clinical inquiry skills. Through comprehensive evaluations, our approach demonstrates more authentic and professional SP-scenario dialogue flows compared to other LLM-based chatbots, thus proving its proficiency in simulating patients. Additionally, leveraging CureFun's evaluation ability, we assess several medical LLMs and discuss the possibilities and limitations of using LLMs as virtual doctors from the perspective of their diagnostic abilities.
Paper Structure (11 sections, 7 figures, 3 tables)

This paper contains 11 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The overview diagram of this study. Curefun integrates LLMs to simulate patient roles, enhancing the dialogue flow via structured graph memory, and providing automatic assessment for student-patient conversations.
  • Figure 2: Violin plot of the token length distribution of responses from different underlying LLMs when acting as SP and answering inquiries from doctors. The statistics on the left side of each violin represent the vanilla models' generation, while the right sides represent the responses generated with our proposed framework.
  • Figure 3: (a) Heatmap of pairwise comparison for LLMs in acting SPs. The left-side labels represent the LLM used as "player", while the bottom labels represent the "opponents". "+C" denotes the corresponding model is collaborating with our framework. (b) The B-ELO score distribution with or without our framework, P < 0.05, one-sided Wilcoxon's rank-sum test.
  • Figure 4: The distribution of scores from program evaluator and human evaluator.
  • Figure 5: Preview of a SP case. (a) Original case script/template. (b) Extracted case graph.
  • ...and 2 more figures