Table of Contents
Fetching ...

Training and Evaluation of Guideline-Based Medical Reasoning in LLMs

Michael Staniek, Artem Sokolov, Stefan Riezler

TL;DR

The paper tackles the challenge of trustworthy medical AI by teaching LLMs to follow verbalized medical consensus rules (exemplified by Sepsis-3) in a step-by-step manner. It combines deductive SOFA-based inference with inductive time-series forecasting, trained via fine-tuning (LoRA) on rule-instantiation data and augmented with a TSF forecaster and multimodal inputs. Key findings show small, fine-tuned models can exceed much larger prompt-based or text-only baselines in derivation and value correctness, while inductive forecasting remains the main bottleneck, potentially mitigated by multimodal integration. The work demonstrates automatic, automatic evaluation of inference chains against gold-standard consensus rules, advancing faithful, explainable medical reasoning and providing a clear path for extending rule-based reasoning to other consensus guidelines.

Abstract

Machine learning for early prediction in medicine has recently shown breakthrough performance, however, the focus on improving prediction accuracy has led to a neglect of faithful explanations that are required to gain the trust of medical practitioners. The goal of this paper is to teach LLMs to follow medical consensus guidelines step-by-step in their reasoning and prediction process. Since consensus guidelines are ubiquitous in medicine, instantiations of verbalized medical inference rules to electronic health records provide data for fine-tuning LLMs to learn consensus rules and possible exceptions thereof for many medical areas. Consensus rules also enable an automatic evaluation of the model's inference process regarding its derivation correctness (evaluating correct and faithful deduction of a conclusion from given premises) and value correctness (comparing predicted values against real-world measurements). We exemplify our work using the complex Sepsis-3 consensus definition. Our experiments show that small fine-tuned models outperform one-shot learning of considerably larger LLMs that are prompted with the explicit definition and models that are trained on medical texts including consensus definitions. Since fine-tuning on verbalized rule instantiations of a specific medical area yields nearly perfect derivation correctness for rules (and exceptions) on unseen patient data in that area, the bottleneck for early prediction is not out-of-distribution generalization, but the orthogonal problem of generalization into the future by forecasting sparsely and irregularly sampled clinical variables. We show that the latter results can be improved by integrating the output representations of a time series forecasting model with the LLM in a multimodal setup.

Training and Evaluation of Guideline-Based Medical Reasoning in LLMs

TL;DR

The paper tackles the challenge of trustworthy medical AI by teaching LLMs to follow verbalized medical consensus rules (exemplified by Sepsis-3) in a step-by-step manner. It combines deductive SOFA-based inference with inductive time-series forecasting, trained via fine-tuning (LoRA) on rule-instantiation data and augmented with a TSF forecaster and multimodal inputs. Key findings show small, fine-tuned models can exceed much larger prompt-based or text-only baselines in derivation and value correctness, while inductive forecasting remains the main bottleneck, potentially mitigated by multimodal integration. The work demonstrates automatic, automatic evaluation of inference chains against gold-standard consensus rules, advancing faithful, explainable medical reasoning and providing a clear path for extending rule-based reasoning to other consensus guidelines.

Abstract

Machine learning for early prediction in medicine has recently shown breakthrough performance, however, the focus on improving prediction accuracy has led to a neglect of faithful explanations that are required to gain the trust of medical practitioners. The goal of this paper is to teach LLMs to follow medical consensus guidelines step-by-step in their reasoning and prediction process. Since consensus guidelines are ubiquitous in medicine, instantiations of verbalized medical inference rules to electronic health records provide data for fine-tuning LLMs to learn consensus rules and possible exceptions thereof for many medical areas. Consensus rules also enable an automatic evaluation of the model's inference process regarding its derivation correctness (evaluating correct and faithful deduction of a conclusion from given premises) and value correctness (comparing predicted values against real-world measurements). We exemplify our work using the complex Sepsis-3 consensus definition. Our experiments show that small fine-tuned models outperform one-shot learning of considerably larger LLMs that are prompted with the explicit definition and models that are trained on medical texts including consensus definitions. Since fine-tuning on verbalized rule instantiations of a specific medical area yields nearly perfect derivation correctness for rules (and exceptions) on unseen patient data in that area, the bottleneck for early prediction is not out-of-distribution generalization, but the orthogonal problem of generalization into the future by forecasting sparsely and irregularly sampled clinical variables. We show that the latter results can be improved by integrating the output representations of a time series forecasting model with the LLM in a multimodal setup.

Paper Structure

This paper contains 25 sections, 2 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Inductive and deductive inference rules in the Sepsis-3 Consensus Definition SingerETAL:16SeymourETAL:16. Deductive rules calculate extrema over time, map thresholds onto step functions for SOFA scores, and calculate total SOFA and changes over time. Inductive rules involve time series forecasting of clinical variables 24 hours into the future.
  • Figure 2: Fine-tuning data including a verbalization of Sepsis-3 inference (left column) and inference under an exception due to medical preconditions (right column). Differences are shown in bold blue font. The general prompt is shown above the horizontal line, the gold standard answer below.
  • Figure 3: Derivation correctness (dashed blue arrows) checks for the predicted inference graph whether each child node (conclusion) follows from the parent node (premise) according to the consensus rule. Value correctness (dotted red arrows) maps each node in the predicted inference graph to its corresponding node in the ground truth graph, consisting of real-world clinical measurements for first and second 24 hours, and deterministic calculation of SOFA and SEPSIS on these values.
  • Figure 4: Time Series Forecasting using a dense encoder and iterative multistep decoder architecture.