Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and Dataset

Hengguan Huang; Songtao Wang; Hongfu Liu; Hao Wang; Ye Wang

Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and Dataset

Hengguan Huang, Songtao Wang, Hongfu Liu, Hao Wang, Ye Wang

TL;DR

This work addresses the lack of scalable, real-time feedback for communicative medical coaching by introducing ChatCoach, a two-agent framework (patient and coach) guided by Generalized Chain-of-Thought ($GCoT$). It also presents the ChatCoach dataset, built via a multi-agent data-generation pipeline conditioned on external medical resources, to benchmark LLMs on detection and correction of medical terminology misuse. Empirical results show that $GCoT$ improves the structure and external-knowledge integration of coach feedback, outperforming several prompting baselines and approaching human-like guidance in some metrics, while highlighting remaining gaps relative to expert feedback. The work advances medical education with AI by providing a concrete evaluation platform and actionable prompting strategy that enables real-time coaching in clinical conversations, with potential to enhance clinician training and communication quality at scale.

Abstract

Traditional applications of natural language processing (NLP) in healthcare have predominantly focused on patient-centered services, enhancing patient interactions and care delivery, such as through medical dialogue systems. However, the potential of NLP to benefit inexperienced doctors, particularly in areas such as communicative medical coaching, remains largely unexplored. We introduce "ChatCoach", a human-AI cooperative framework designed to assist medical learners in practicing their communication skills during patient consultations. ChatCoach (Our data and code are available online: https://github.com/zerowst/Chatcoach)differentiates itself from conventional dialogue systems by offering a simulated environment where medical learners can practice dialogues with a patient agent, while a coach agent provides immediate, structured feedback. This is facilitated by our proposed Generalized Chain-of-Thought (GCoT) approach, which fosters the generation of structured feedback and enhances the utilization of external knowledge sources. Additionally, we have developed a dataset specifically for evaluating Large Language Models (LLMs) within the ChatCoach framework on communicative medical coaching tasks. Our empirical results validate the effectiveness of ChatCoach.

Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and Dataset

TL;DR

). It also presents the ChatCoach dataset, built via a multi-agent data-generation pipeline conditioned on external medical resources, to benchmark LLMs on detection and correction of medical terminology misuse. Empirical results show that

improves the structure and external-knowledge integration of coach feedback, outperforming several prompting baselines and approaching human-like guidance in some metrics, while highlighting remaining gaps relative to expert feedback. The work advances medical education with AI by providing a concrete evaluation platform and actionable prompting strategy that enables real-time coaching in clinical conversations, with potential to enhance clinician training and communication quality at scale.

Abstract

Paper Structure (28 sections, 2 equations, 3 figures, 9 tables)

This paper contains 28 sections, 2 equations, 3 figures, 9 tables.

Introduction
Related Work
Medical NLP Applications with LLM
Medical Education with NLP
Prompting-based Method
Communicative Medical Coaching
Problem Formulation
System Overview
Generalized Chain-of-Thought (GCoT)
Constructing the ChatCoach Dataset: A Multi-Agent Approach for Generating Domain-Specific Conversational Data
Data Generation Conditioned on External Resources
Task Descriptions
Human Annotation
Dataset Overview
Experiments
...and 13 more sections

Figures (3)

Figure 1: (a) General framework of communicative medical coaching. (b) Multi-agent data generation framework using external resources.
Figure 2: Example of coach feedback generated by various approaches. Vanilla CoT fails to identify errors in medical terminology, possibly due to lacking integration with external knowledge. While thorough, Zero-shot CoT generates overly verbose feedback unsuited for real-time application. In contrast, GCoT identifies errors effectively and provides concise and well-structured feedback, demonstrating superior integration of external medical knowledge for practical real-time coaching.
Figure 3: A failed example of coach feedback from various prompting-based approaches, demonstrating the issue of excessive coaching.

Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and Dataset

TL;DR

Abstract

Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (3)