Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication
Philip Chung, Christine T Fong, Andrew M Walters, Nima Aghaeepour, Meliha Yetisgen, Vikas N O'Reilly-Shah
TL;DR
This study evaluates GPT-4 Turbo for perioperative risk prediction using procedure descriptions and preoperative EHR notes across eight outcomes. It finds that the LLM achieves notable performance on classification tasks such as ASA-PS, ICU admission, and hospital mortality, especially with few-shot and chain-of-thought prompting, while duration predictions remain poorly calibrated. Summaries of patient notes can help scale few-shot prompting and improve interpretability, though note length effects are task-dependent. The authors highlight the potential clinical utility of LLMs as decision-support tools that provide natural-language explanations, while acknowledging limitations and the need for domain-specific models and prospective validation to augment or replace existing prediction approaches.
Abstract
We investigate whether general-domain large language models such as GPT-4 Turbo can perform risk stratification and predict post-operative outcome measures using a description of the procedure and a patient's clinical notes derived from the electronic health record. We examine predictive performance on 8 different tasks: prediction of ASA Physical Status Classification, hospital admission, ICU admission, unplanned admission, hospital mortality, PACU Phase 1 duration, hospital duration, and ICU duration. Few-shot and chain-of-thought prompting improves predictive performance for several of the tasks. We achieve F1 scores of 0.50 for ASA Physical Status Classification, 0.81 for ICU admission, and 0.86 for hospital mortality. Performance on duration prediction tasks were universally poor across all prompt strategies. Current generation large language models can assist clinicians in perioperative risk stratification on classification tasks and produce high-quality natural language summaries and explanations.
