A Continued Pretrained LLM Approach for Automatic Medical Note Generation
Dong Yuan, Eti Rastogi, Gautam Naik, Sree Prasanna Rajagopal, Sagar Goyal, Fen Zhao, Bharath Chintagunta, Jeff Ward
TL;DR
This work targets efficient, domain-specific medical documentation by training HEAL, a 13B LLaMA2-based LLM, through continued pretraining and instruction tuning on mixed medical data to produce physician-ready notes. With ~14.89B tokens and an 8K context length, HEAL achieves 78.4% accuracy on PubMedQA—surpassing GPT-4 and PMC-LLaMA—and matches GPT-4 in medical-note generation, while also excelling in identifying correct medical concepts. The results demonstrate that a compact model, properly specialized, can outperform larger baselines on clinical information extraction and note completeness, suggesting a cost-effective path for healthcare transcription tools. The work underscores the value of targeted data curation and continued training for domain-specific LLMs and motivates further scaling and refinement in medical AI applications.
Abstract
LLMs are revolutionizing NLP tasks. However, the use of the most advanced LLMs, such as GPT-4, is often prohibitively expensive for most specialized fields. We introduce HEAL, the first continuously trained 13B LLaMA2-based LLM that is purpose-built for medical conversations and measured on automated scribing. Our results demonstrate that HEAL outperforms GPT-4 and PMC-LLaMA in PubMedQA, with an accuracy of 78.4\%. It also achieves parity with GPT-4 in generating medical notes. Remarkably, HEAL surpasses GPT-4 and Med-PaLM 2 in identifying more correct medical concepts and exceeds the performance of human scribes and other comparable models in correctness and completeness.
