Table of Contents
Fetching ...

Towards Adapting Open-Source Large Language Models for Expert-Level Clinical Note Generation

Hanyin Wang, Chufan Gao, Bolun Liu, Qiping Xu, Guleid Hussein, Mohamad El Labban, Kingsley Iheasirim, Hariprasad Korsapati, Chuck Outcalt, Jimeng Sun

TL;DR

This work demonstrates that an open-source LLM (LLaMA-2-13B) can achieve expert-level outpatient note generation from patient–doctor dialogues through a domain- and task-specific adaptation pipeline combining continued pretraining, supervised fine-tuning, and reinforcement learning from AI and human feedback. The authors introduce DistillDirect to ensure on-policy learning, using Gemini 1.0 Pro as the teacher, and validate the approach with a blinded physician reader study, where LLaMA-Clinic achieves high acceptability and real-world readiness comparable to physician-authored notes, particularly in the challenging Assessment and Plan section. They also emphasize creating a standardized “best practice” note format to reduce variability and improve model guidance, and they publicly release synthetic dialogue-note data to support future research. The study highlights practical benefits of local, privately hosted models, including cost reductions and privacy advantages, while outlining the challenges of data heterogeneity, system stability during training, and the need for careful workflow integration with physician oversight. Overall, the results suggest that open-source clinical note generation is feasible at near-physician quality with scalable data and careful alignment through on-policy RLHF, offering a path toward private, cost-effective, domain-tuned clinical NLP tools.

Abstract

Proprietary Large Language Models (LLMs) such as GPT-4 and Gemini have demonstrated promising capabilities in clinical text summarization tasks. However, due to patient data privacy concerns and computational costs, many healthcare providers prefer using small, locally-hosted models over external generic LLMs. This study presents a comprehensive domain- and task-specific adaptation process for the open-source LLaMA-2 13 billion parameter model, enabling it to generate high-quality clinical notes from outpatient patient-doctor dialogues. Our process incorporates continued pretraining, supervised fine-tuning, and reinforcement learning from both AI and human feedback. We introduced a new approach, DistillDirect, for performing on-policy reinforcement learning with Gemini 1.0 Pro as the teacher model. Our resulting model, LLaMA-Clinic, can generate clinical notes comparable in quality to those authored by physicians. In a blinded physician reader study, the majority (92.8%) of individual evaluations rated the notes generated by LLaMA-Clinic as "acceptable" or higher across three criteria: real-world readiness, completeness, and accuracy. In the more challenging "Assessment and Plan" section, LLaMA-Clinic matched physician-authored notes in real-world readiness score. We highlight key considerations for future clinical note-generation tasks, emphasizing the importance of pre-defining a "best practice" note format, rather than relying on LLMs to determine this for clinical practice.

Towards Adapting Open-Source Large Language Models for Expert-Level Clinical Note Generation

TL;DR

This work demonstrates that an open-source LLM (LLaMA-2-13B) can achieve expert-level outpatient note generation from patient–doctor dialogues through a domain- and task-specific adaptation pipeline combining continued pretraining, supervised fine-tuning, and reinforcement learning from AI and human feedback. The authors introduce DistillDirect to ensure on-policy learning, using Gemini 1.0 Pro as the teacher, and validate the approach with a blinded physician reader study, where LLaMA-Clinic achieves high acceptability and real-world readiness comparable to physician-authored notes, particularly in the challenging Assessment and Plan section. They also emphasize creating a standardized “best practice” note format to reduce variability and improve model guidance, and they publicly release synthetic dialogue-note data to support future research. The study highlights practical benefits of local, privately hosted models, including cost reductions and privacy advantages, while outlining the challenges of data heterogeneity, system stability during training, and the need for careful workflow integration with physician oversight. Overall, the results suggest that open-source clinical note generation is feasible at near-physician quality with scalable data and careful alignment through on-policy RLHF, offering a path toward private, cost-effective, domain-tuned clinical NLP tools.

Abstract

Proprietary Large Language Models (LLMs) such as GPT-4 and Gemini have demonstrated promising capabilities in clinical text summarization tasks. However, due to patient data privacy concerns and computational costs, many healthcare providers prefer using small, locally-hosted models over external generic LLMs. This study presents a comprehensive domain- and task-specific adaptation process for the open-source LLaMA-2 13 billion parameter model, enabling it to generate high-quality clinical notes from outpatient patient-doctor dialogues. Our process incorporates continued pretraining, supervised fine-tuning, and reinforcement learning from both AI and human feedback. We introduced a new approach, DistillDirect, for performing on-policy reinforcement learning with Gemini 1.0 Pro as the teacher model. Our resulting model, LLaMA-Clinic, can generate clinical notes comparable in quality to those authored by physicians. In a blinded physician reader study, the majority (92.8%) of individual evaluations rated the notes generated by LLaMA-Clinic as "acceptable" or higher across three criteria: real-world readiness, completeness, and accuracy. In the more challenging "Assessment and Plan" section, LLaMA-Clinic matched physician-authored notes in real-world readiness score. We highlight key considerations for future clinical note-generation tasks, emphasizing the importance of pre-defining a "best practice" note format, rather than relying on LLMs to determine this for clinical practice.
Paper Structure (67 sections, 3 equations, 9 figures, 13 tables)

This paper contains 67 sections, 3 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Overview of Study Design. We conducted a comprehensive domain- and task-specific adaptation process for the LLaMA-2-13B model. This process included continued pretraining, supervised fine-tuning, and reinforcement learning from AI and human feedback. Finally, we evaluated our model's outputs against those created by physicians and Gemini Pro through a blinded expert evaluation. We used Gemini 1.0 Pro as the teacher model in this study.
  • Figure 1: Training Loss Spikes for 13B Models during Continued Pretraining. A. Loss curve for base model with bf16 training on vanilla LLaMA-recipes without a LR scheduler. B. Loss curve for base model with mixed precision training without a LR scheduler. C. Loss curve for chat model with a LR scheduler and bf16 training. All experiments were performed on Discharge-long dataset. The X-axis represents processed training tokens, and the Y-axis represents training loss. Original loss curves were shown without smoothing.
  • Figure 2: Comparison of Distilled DPO, DistillDirect, and RLHF.A. Distilled DPO: Preference dataset is generated and labeled by external LLMs rather than by the target policy, resulting in off-policy and offline training. B. DistillDirect: A response is generated from the target policy for each prompt, thereby making training on-policy. Additionally, another response is generated from an external LLM serving as the teacher model. C. RLHF: All responses are generated by the target policy, and preference labeling is completed by humans. Consequently, the training process is on-policy and online. In our study, we utilized DistillDirect for on-policy learning of RLAIF followed by further online and on-policy learning using RLHF.
  • Figure 2: Example Training Set Accuracy and Reward Margin during DPO with a LR of 2e-5. Examples taken from 13B-chat-short_R3, and all other runs have similar training curve with high accuracy and reward margin early on with this LR.
  • Figure 3: Training Loss Curve from Continued Pretraining. A. Training with the Discharge-long dataset (1.2 billion tokens). B. Training with the Discharge-short dataset (0.2 billion tokens). The X-axis represents processed training tokens, and the Y-axis represents training loss. The figures illustrate results from mixed precision training with a cosine learning rate scheduler. All experiments were trained for 1 epoch on their respective training datasets. The loss curve in the solid line was smoothed with an exponential moving average and a window size of 250 steps. The original loss values are shown as the faded background.
  • ...and 4 more figures