Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication

Philip Chung; Christine T Fong; Andrew M Walters; Nima Aghaeepour; Meliha Yetisgen; Vikas N O'Reilly-Shah

Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication

Philip Chung, Christine T Fong, Andrew M Walters, Nima Aghaeepour, Meliha Yetisgen, Vikas N O'Reilly-Shah

TL;DR

This study evaluates GPT-4 Turbo for perioperative risk prediction using procedure descriptions and preoperative EHR notes across eight outcomes. It finds that the LLM achieves notable performance on classification tasks such as ASA-PS, ICU admission, and hospital mortality, especially with few-shot and chain-of-thought prompting, while duration predictions remain poorly calibrated. Summaries of patient notes can help scale few-shot prompting and improve interpretability, though note length effects are task-dependent. The authors highlight the potential clinical utility of LLMs as decision-support tools that provide natural-language explanations, while acknowledging limitations and the need for domain-specific models and prospective validation to augment or replace existing prediction approaches.

Abstract

We investigate whether general-domain large language models such as GPT-4 Turbo can perform risk stratification and predict post-operative outcome measures using a description of the procedure and a patient's clinical notes derived from the electronic health record. We examine predictive performance on 8 different tasks: prediction of ASA Physical Status Classification, hospital admission, ICU admission, unplanned admission, hospital mortality, PACU Phase 1 duration, hospital duration, and ICU duration. Few-shot and chain-of-thought prompting improves predictive performance for several of the tasks. We achieve F1 scores of 0.50 for ASA Physical Status Classification, 0.81 for ICU admission, and 0.86 for hospital mortality. Performance on duration prediction tasks were universally poor across all prompt strategies. Current generation large language models can assist clinicians in perioperative risk stratification on classification tasks and produce high-quality natural language summaries and explanations.

Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication

TL;DR

Abstract

Paper Structure (52 sections, 22 figures)

This paper contains 52 sections, 22 figures.

Introduction
Methods
Study Cohort and Dataset Definition
Experimental Approach
Results
Datasets
Effect of Prompt Strategy on Perioperative Risk Prediction Tasks
Effect of Summary Representation of Notes
Effect of Note Length on Perioperative Risk Prediction Tasks
Numerical Prediction Tasks
Discussion
Conclusion
Supplemental Tables
Supplemental Table 1: Note Type & Author Provider Type
Supplemental Table 2: Experiment Costs
...and 37 more sections

Figures (22)

Figure 1: Overview of the experimental apparatus. GPT-4 Turbo is used as the large language model (LLM) in all steps. Each prompt to the LLM is unique based on the task, prompt strategy and query case for which an answer and explanation is generated. Zero-shot prompt strategy is conducted with both original clinical notes and a summary of the clinical notes. Few-shot prompts utilize in-context examples derived from the few-shot dataset. Each in-context example is a question, procedure description, summary of patient notes, and answer. Summaries are generated using LLM. The few-shot chain-of-thought (CoT) prompt strategy requires a CoT rationale for each in-context example that links the question to the answer, which is also generated using LLM. Answers provided by the LLM are compared against the ground truth label derived from electronic health record (EHR) data, and either F1 score or mean absolute error (MAE) is computed, depending on whether the outcome variable for the task is categorical/binary or integer.
Figure 2: Performance for the 8 perioperative prediction tasks. X-axis shows the different prompt strategies with the first six without chain-of-thought reasoning and the second six with chain-of-thought reasoning. “Notes” indicates that original clinical notes were inserted into the prompt whereas “Summary” indicates that clinical notes were first summarized using GPT-4 Turbo and then the summary was inserted into the prompt. All in-context examples for few-shot prompts used note summaries. Y-axis is F1 Score for classification tasks where higher score is better, and Mean Absolute Error for regression tasks where lower error is better. Baseline for classification tasks represent score achieved by random guessing. Baseline for regression tasks represent the MAE achieved by a regressor that always predicts the mean value in the dataset. The clinical notes are stratified into short, medium, and long length groups which represent the $\frac{1}{3}$ shortest, $\frac{1}{3}$ middle, and $\frac{1}{3}$ longest notes in the dataset and performance is shown for each stratification.
Figure 3: Prompt and LLM Output for Zero-shot chain-of-thought Q&A from notes summary prompt strategy. Note summaries are generated from raw clinical notes from the LLM prior to insertion into the prompt. The LLM output shows that the LLM understands the definition for ASA Physical Status Classification (ASA-PS) and provides a valid rationale for which ASA-PS class the patient should be classified. All prompt strategies using this patient and procedure example are depicted in Supplemental Figure 1. While the content of this example is derived from a real patient and case from the electronic health record, all PHI and PII are removed with names obfuscated, and dates and times shifted.
Figure 4: Scatter plot of predicted and actual post-anesthesia care unit (PACU) Phase 1 recovery durations across all 12 prompt strategies. Without few-shot and CoT prompting, predictions are heavily quantized to specific values and exhibit a ceiling effect where the LLM rarely predicts beyond 180 minutes. The progressive addition of few-shot and CoT prompting removes this effect, but predictive performance remains poor.
Figure 5: Flow diagram showing how the task-specific datasets were constructed from Electronic Health Record data. The natural occurrence of certain outcomes such as ICU admission, unplanned admission, and hospital mortality are rare, so datasets are constructed to balance the task label. If patients have multiple procedure cases, only a single case for that patient was included in the final dataset. “Clinical Notes” refers to up to the last 10 clinician-written notes filed prior to each procedure, excluding notes directly associated with the procedure itself. Due to the rarity of ICU admission, the datasets for ICU Duration and Admission tasks are identical.
...and 17 more figures

Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication

TL;DR

Abstract

Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication

Authors

TL;DR

Abstract

Table of Contents

Figures (22)