Table of Contents
Fetching ...

Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

Lavender Y. Jiang, Angelica Chen, Xu Han, Xujin Chris Liu, Radhika Dua, Kevin Eaton, Frederick Wolff, Robert Steele, Jeff Zhang, Anton Alyakin, Qingkai Pan, Yanbing Chen, Karl L. Sangwon, Daniel A. Alber, Jaden Stryker, Jin Vivian Lee, Yindalon Aphinyanaphongs, Kyunghyun Cho, Eric Karl Oermann

TL;DR

This work shows that general-purpose foundation models struggle with real-world hospital-operations tasks, and that a domain-specialized decoder model (Lang1) trained on a large corpus of EHR notes and web data, followed by supervised finetuning on real operational tasks, can outperform much larger generalists. Using the ReMedE benchmark built from 668,331 EHR notes, Lang1-1B achieves superior AUROC across five operational tasks and demonstrates cross-task and cross-institution transfer to MIMIC III. The findings indicate that predictive capabilities for hospital operations require explicit supervised finetuning, with pretraining providing data-efficiency benefits but not replacing task-specific supervision. Training in-house specialized models is financially feasible and strategically advantageous for health systems, reducing reliance on external APIs and enabling adaptation to evolving clinical practices, while real-world evaluation remains essential beyond proxy benchmarks.

Abstract

Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.

Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

TL;DR

This work shows that general-purpose foundation models struggle with real-world hospital-operations tasks, and that a domain-specialized decoder model (Lang1) trained on a large corpus of EHR notes and web data, followed by supervised finetuning on real operational tasks, can outperform much larger generalists. Using the ReMedE benchmark built from 668,331 EHR notes, Lang1-1B achieves superior AUROC across five operational tasks and demonstrates cross-task and cross-institution transfer to MIMIC III. The findings indicate that predictive capabilities for hospital operations require explicit supervised finetuning, with pretraining providing data-efficiency benefits but not replacing task-specific supervision. Training in-house specialized models is financially feasible and strategically advantageous for health systems, reducing reliance on external APIs and enabling adaptation to evolving clinical practices, while real-world evaluation remains essential beyond proxy benchmarks.

Abstract

Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.

Paper Structure

This paper contains 69 sections, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Overview. (a) We mix unlabeled EHR notes and web texts as our pretrain corpus. (b) We pretrain using next token prediction. (c) Instruction finetuning in multiple choice format enables cross-task transfer. (d) We compare Lang1 to off-the-shelf generalist models. (e) In order to derive design principles, we do ablations on data mix, model scale, pretrain trajectories, data scale, eval task type, eval hospital, and eval time.
  • Figure 2: Finetuned small specialists outperform strong generalists on ReMedE.
  • Figure 3: Zero-shot clinical classification performance does not increase over the course of pretraining, unlike reading comprehension. Error bars depict the 95% confidence interval.
  • Figure 4: Finetuning is are more token-efficient than pretraining for performance gains (\ref{['fig:finetuning-vs-pretraining-efficiency']}), but in-domain pretraining enables sample-efficient finetuning (\ref{['fig:low-vs-full-finetune']}). This advantage is also associated with lower perplexity on in-domain tasks (\ref{['fig:ppl_zero_shot_finetune']}).
  • Figure 5: Lang1 is able to transfer to unseen task (\ref{['fig:heatmap']}) and a different health system (\ref{['fig:mimic_transfer']}).
  • ...and 11 more figures