Towards Adapting Open-Source Large Language Models for Expert-Level Clinical Note Generation
Hanyin Wang, Chufan Gao, Bolun Liu, Qiping Xu, Guleid Hussein, Mohamad El Labban, Kingsley Iheasirim, Hariprasad Korsapati, Chuck Outcalt, Jimeng Sun
TL;DR
This work demonstrates that an open-source LLM (LLaMA-2-13B) can achieve expert-level outpatient note generation from patient–doctor dialogues through a domain- and task-specific adaptation pipeline combining continued pretraining, supervised fine-tuning, and reinforcement learning from AI and human feedback. The authors introduce DistillDirect to ensure on-policy learning, using Gemini 1.0 Pro as the teacher, and validate the approach with a blinded physician reader study, where LLaMA-Clinic achieves high acceptability and real-world readiness comparable to physician-authored notes, particularly in the challenging Assessment and Plan section. They also emphasize creating a standardized “best practice” note format to reduce variability and improve model guidance, and they publicly release synthetic dialogue-note data to support future research. The study highlights practical benefits of local, privately hosted models, including cost reductions and privacy advantages, while outlining the challenges of data heterogeneity, system stability during training, and the need for careful workflow integration with physician oversight. Overall, the results suggest that open-source clinical note generation is feasible at near-physician quality with scalable data and careful alignment through on-policy RLHF, offering a path toward private, cost-effective, domain-tuned clinical NLP tools.
Abstract
Proprietary Large Language Models (LLMs) such as GPT-4 and Gemini have demonstrated promising capabilities in clinical text summarization tasks. However, due to patient data privacy concerns and computational costs, many healthcare providers prefer using small, locally-hosted models over external generic LLMs. This study presents a comprehensive domain- and task-specific adaptation process for the open-source LLaMA-2 13 billion parameter model, enabling it to generate high-quality clinical notes from outpatient patient-doctor dialogues. Our process incorporates continued pretraining, supervised fine-tuning, and reinforcement learning from both AI and human feedback. We introduced a new approach, DistillDirect, for performing on-policy reinforcement learning with Gemini 1.0 Pro as the teacher model. Our resulting model, LLaMA-Clinic, can generate clinical notes comparable in quality to those authored by physicians. In a blinded physician reader study, the majority (92.8%) of individual evaluations rated the notes generated by LLaMA-Clinic as "acceptable" or higher across three criteria: real-world readiness, completeness, and accuracy. In the more challenging "Assessment and Plan" section, LLaMA-Clinic matched physician-authored notes in real-world readiness score. We highlight key considerations for future clinical note-generation tasks, emphasizing the importance of pre-defining a "best practice" note format, rather than relying on LLMs to determine this for clinical practice.
