Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and Dataset
Hengguan Huang, Songtao Wang, Hongfu Liu, Hao Wang, Ye Wang
TL;DR
This work addresses the lack of scalable, real-time feedback for communicative medical coaching by introducing ChatCoach, a two-agent framework (patient and coach) guided by Generalized Chain-of-Thought ($GCoT$). It also presents the ChatCoach dataset, built via a multi-agent data-generation pipeline conditioned on external medical resources, to benchmark LLMs on detection and correction of medical terminology misuse. Empirical results show that $GCoT$ improves the structure and external-knowledge integration of coach feedback, outperforming several prompting baselines and approaching human-like guidance in some metrics, while highlighting remaining gaps relative to expert feedback. The work advances medical education with AI by providing a concrete evaluation platform and actionable prompting strategy that enables real-time coaching in clinical conversations, with potential to enhance clinician training and communication quality at scale.
Abstract
Traditional applications of natural language processing (NLP) in healthcare have predominantly focused on patient-centered services, enhancing patient interactions and care delivery, such as through medical dialogue systems. However, the potential of NLP to benefit inexperienced doctors, particularly in areas such as communicative medical coaching, remains largely unexplored. We introduce "ChatCoach", a human-AI cooperative framework designed to assist medical learners in practicing their communication skills during patient consultations. ChatCoach (Our data and code are available online: https://github.com/zerowst/Chatcoach)differentiates itself from conventional dialogue systems by offering a simulated environment where medical learners can practice dialogues with a patient agent, while a coach agent provides immediate, structured feedback. This is facilitated by our proposed Generalized Chain-of-Thought (GCoT) approach, which fosters the generation of structured feedback and enhances the utilization of external knowledge sources. Additionally, we have developed a dataset specifically for evaluating Large Language Models (LLMs) within the ChatCoach framework on communicative medical coaching tasks. Our empirical results validate the effectiveness of ChatCoach.
