Table of Contents
Fetching ...

The Foundational Capabilities of Large Language Models in Predicting Postoperative Risks Using Clinical Notes

Charles Alba, Bing Xue, Joanna Abraham, Thomas Kannampallil, Chenyang Lu

TL;DR

This study investigates whether preoperative clinical notes harbor predictive signals for six postoperative risks and whether large language models can leverage that information. By comparing clinically oriented pretrained LLMs to traditional word embeddings and applying progressively stronger fine-tuning strategies—self-supervised, label-informed, and a multi-task foundation approach—the authors demonstrate substantial performance gains, with AUROC improving by up to 3.6% and AUPRC by up to 2.6% using a unified foundation model. The work also shows that the benefits generalize beyond a single center, including replication on MIMIC-III, and provides qualitative safety analyses to support clinical applicability. Overall, the findings support foundational capabilities of LLMs in perioperative risk prediction from notes and highlight practical pathways for deploying a single, multi-task model in perioperative care while noting limitations and areas for future validation.

Abstract

Clinical notes recorded during a patient's perioperative journey holds immense informational value. Advances in large language models (LLMs) offer opportunities for bridging this gap. Using 84,875 pre-operative notes and its associated surgical cases from 2018 to 2021, we examine the performance of LLMs in predicting six postoperative risks using various fine-tuning strategies. Pretrained LLMs outperformed traditional word embeddings by an absolute AUROC of 38.3% and AUPRC of 33.2%. Self-supervised fine-tuning further improved performance by 3.2% and 1.5%. Incorporating labels into training further increased AUROC by 1.8% and AUPRC by 2%. The highest performance was achieved with a unified foundation model, with improvements of 3.6% for AUROC and 2.6% for AUPRC compared to self-supervision, highlighting the foundational capabilities of LLMs in predicting postoperative risks, which could be potentially beneficial when deployed for perioperative care

The Foundational Capabilities of Large Language Models in Predicting Postoperative Risks Using Clinical Notes

TL;DR

This study investigates whether preoperative clinical notes harbor predictive signals for six postoperative risks and whether large language models can leverage that information. By comparing clinically oriented pretrained LLMs to traditional word embeddings and applying progressively stronger fine-tuning strategies—self-supervised, label-informed, and a multi-task foundation approach—the authors demonstrate substantial performance gains, with AUROC improving by up to 3.6% and AUPRC by up to 2.6% using a unified foundation model. The work also shows that the benefits generalize beyond a single center, including replication on MIMIC-III, and provides qualitative safety analyses to support clinical applicability. Overall, the findings support foundational capabilities of LLMs in perioperative risk prediction from notes and highlight practical pathways for deploying a single, multi-task model in perioperative care while noting limitations and areas for future validation.

Abstract

Clinical notes recorded during a patient's perioperative journey holds immense informational value. Advances in large language models (LLMs) offer opportunities for bridging this gap. Using 84,875 pre-operative notes and its associated surgical cases from 2018 to 2021, we examine the performance of LLMs in predicting six postoperative risks using various fine-tuning strategies. Pretrained LLMs outperformed traditional word embeddings by an absolute AUROC of 38.3% and AUPRC of 33.2%. Self-supervised fine-tuning further improved performance by 3.2% and 1.5%. Incorporating labels into training further increased AUROC by 1.8% and AUPRC by 2%. The highest performance was achieved with a unified foundation model, with improvements of 3.6% for AUROC and 2.6% for AUPRC compared to self-supervision, highlighting the foundational capabilities of LLMs in predicting postoperative risks, which could be potentially beneficial when deployed for perioperative care
Paper Structure (25 sections, 3 figures, 2 tables)

This paper contains 25 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: An illustration of the architectures encompassing different fine-tuning strategies experimented in our study, encompassing the results reported from Sections \ref{['sec:pretrained_vs_word']} to \ref{['sec:foundational']}. Fig 1a (top) illustrates how using the pretrained model alone differs from self-supervised fine-tuning when clinical texts are provided to the pretrained LLM to refine the model weights with respect to its objective loss functions. Fig 1b (below) illustrates two separate fine-tuning strategies: semi-supervised fine-tuning – creating a model that is fine-tuned under the supervision of a specific outcome; and foundation fine-tuning – creating a foundation model that is fine-tuned through a multi-task learning (MTL) objective using all available postoperative labels in the dataset.
  • Figure 2: Comparison of the predictive performance across various models and their respective tuning strategies. The bar graph illustrates the means, with the error bar representing the respective standard errors, across a 5-fold cross-validation. Fine-tuning with the models self-supervised pretraining objectives improved prediction performance relative to the pretrained models alone, with the incorporation of labels further boosting prediction performances. The model performs best with the foundation fine-tuning strategy, wherein the model was fine-tuned with a multi-task learning objective across all outcomes. Precise numerical metrics are reported in the supplementary material.
  • Figure 3: Comparison of different machine learning classifiers with that of our default XGBoost predictor applied to our textual representations ($\Delta \text{model}_{i,j} - \text{XGBoost}_i$), including the use of the feed-forward auxiliary layer directly from our foundation model. No single classifier dominated the others across all outcomes and metrics. Surprisingly, the logistic regression classifier performed slightly better than the others, demonstrating that well-tuned language models can generate precise contextual representations to suit a simple classifier. Precise numerical metrics are reported in the supplementary material.