Table of Contents
Fetching ...

A Continued Pretrained LLM Approach for Automatic Medical Note Generation

Dong Yuan, Eti Rastogi, Gautam Naik, Sree Prasanna Rajagopal, Sagar Goyal, Fen Zhao, Bharath Chintagunta, Jeff Ward

TL;DR

This work targets efficient, domain-specific medical documentation by training HEAL, a 13B LLaMA2-based LLM, through continued pretraining and instruction tuning on mixed medical data to produce physician-ready notes. With ~14.89B tokens and an 8K context length, HEAL achieves 78.4% accuracy on PubMedQA—surpassing GPT-4 and PMC-LLaMA—and matches GPT-4 in medical-note generation, while also excelling in identifying correct medical concepts. The results demonstrate that a compact model, properly specialized, can outperform larger baselines on clinical information extraction and note completeness, suggesting a cost-effective path for healthcare transcription tools. The work underscores the value of targeted data curation and continued training for domain-specific LLMs and motivates further scaling and refinement in medical AI applications.

Abstract

LLMs are revolutionizing NLP tasks. However, the use of the most advanced LLMs, such as GPT-4, is often prohibitively expensive for most specialized fields. We introduce HEAL, the first continuously trained 13B LLaMA2-based LLM that is purpose-built for medical conversations and measured on automated scribing. Our results demonstrate that HEAL outperforms GPT-4 and PMC-LLaMA in PubMedQA, with an accuracy of 78.4\%. It also achieves parity with GPT-4 in generating medical notes. Remarkably, HEAL surpasses GPT-4 and Med-PaLM 2 in identifying more correct medical concepts and exceeds the performance of human scribes and other comparable models in correctness and completeness.

A Continued Pretrained LLM Approach for Automatic Medical Note Generation

TL;DR

This work targets efficient, domain-specific medical documentation by training HEAL, a 13B LLaMA2-based LLM, through continued pretraining and instruction tuning on mixed medical data to produce physician-ready notes. With ~14.89B tokens and an 8K context length, HEAL achieves 78.4% accuracy on PubMedQA—surpassing GPT-4 and PMC-LLaMA—and matches GPT-4 in medical-note generation, while also excelling in identifying correct medical concepts. The results demonstrate that a compact model, properly specialized, can outperform larger baselines on clinical information extraction and note completeness, suggesting a cost-effective path for healthcare transcription tools. The work underscores the value of targeted data curation and continued training for domain-specific LLMs and motivates further scaling and refinement in medical AI applications.

Abstract

LLMs are revolutionizing NLP tasks. However, the use of the most advanced LLMs, such as GPT-4, is often prohibitively expensive for most specialized fields. We introduce HEAL, the first continuously trained 13B LLaMA2-based LLM that is purpose-built for medical conversations and measured on automated scribing. Our results demonstrate that HEAL outperforms GPT-4 and PMC-LLaMA in PubMedQA, with an accuracy of 78.4\%. It also achieves parity with GPT-4 in generating medical notes. Remarkably, HEAL surpasses GPT-4 and Med-PaLM 2 in identifying more correct medical concepts and exceeds the performance of human scribes and other comparable models in correctness and completeness.
Paper Structure (13 sections, 2 figures, 4 tables)

This paper contains 13 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Pretraining validation perplexity.
  • Figure 2: Pretraining validation generation capability monitoring.