High Order Reasoning for Time Critical Recommendation in Evidence-based Medicine
Manjiang Yu, Xue Li
TL;DR
The paper tackles time-critical ICU decision-making by examining high-order reasoning in large language models (LLMs) using the eICU dataset. It introduces a four-context framework—What-if, Why-not, So-what, and How-about—and a contrast evaluation framework to benchmark LLMs against physician decisions across multiple ICU tasks, including a post-discharge outcome prediction. The study demonstrates that GPT-4 often aligns closely with human treatment plans in What-if and transferability tasks, and can predict post-ICU status with notable accuracy, while GPT-3.5 Turbo and LLaMA-2 show more variable performance. The results suggest LLMs can support ICU education and decision-making under carefully controlled conditions, but require robust validation and ethical considerations before any clinical deployment. Overall, the work provides a rigorous, education-focused evaluation of high-order reasoning in ICU contexts and highlights avenues for improving AI-assisted critical care through prompting strategies and learning paradigms.
Abstract
In time-critical decisions, human decision-makers can interact with AI-enabled situation-aware software to evaluate many imminent and possible scenarios, retrieve billions of facts, and estimate different outcomes based on trillions of parameters in a fraction of a second. In high-order reasoning, "what-if" questions can be used to challenge the assumptions or pre-conditions of the reasoning, "why-not" questions can be used to challenge on the method applied in the reasoning, "so-what" questions can be used to challenge the purpose of the decision, and "how-about" questions can be used to challenge the applicability of the method. When above high-order reasoning questions are applied to assist human decision-making, it can help humans to make time-critical decisions and avoid false-negative or false-positive types of errors. In this paper, we present a model of high-order reasoning to offer recommendations in evidence-based medicine in a time-critical fashion for the applications in ICU. The Large Language Model (LLM) is used in our system. The experiments demonstrated the LLM exhibited optimal performance in the "What-if" scenario, achieving a similarity of 88.52% with the treatment plans of human doctors. In the "Why-not" scenario, the best-performing model tended to opt for alternative treatment plans in 70% of cases for patients who died after being discharged from the ICU. In the "So-what" scenario, the optimal model provided a detailed analysis of the motivation and significance of treatment plans for ICU patients, with its reasoning achieving a similarity of 55.6% with actual diagnostic information. In the "How-about" scenario, the top-performing LLM demonstrated a content similarity of 66.5% in designing treatment plans transferring for similar diseases. Meanwhile, LLMs managed to predict the life status of patients after their discharge from the ICU with an accuracy of 70%.
