Training and Evaluation of Guideline-Based Medical Reasoning in LLMs
Michael Staniek, Artem Sokolov, Stefan Riezler
TL;DR
The paper tackles the challenge of trustworthy medical AI by teaching LLMs to follow verbalized medical consensus rules (exemplified by Sepsis-3) in a step-by-step manner. It combines deductive SOFA-based inference with inductive time-series forecasting, trained via fine-tuning (LoRA) on rule-instantiation data and augmented with a TSF forecaster and multimodal inputs. Key findings show small, fine-tuned models can exceed much larger prompt-based or text-only baselines in derivation and value correctness, while inductive forecasting remains the main bottleneck, potentially mitigated by multimodal integration. The work demonstrates automatic, automatic evaluation of inference chains against gold-standard consensus rules, advancing faithful, explainable medical reasoning and providing a clear path for extending rule-based reasoning to other consensus guidelines.
Abstract
Machine learning for early prediction in medicine has recently shown breakthrough performance, however, the focus on improving prediction accuracy has led to a neglect of faithful explanations that are required to gain the trust of medical practitioners. The goal of this paper is to teach LLMs to follow medical consensus guidelines step-by-step in their reasoning and prediction process. Since consensus guidelines are ubiquitous in medicine, instantiations of verbalized medical inference rules to electronic health records provide data for fine-tuning LLMs to learn consensus rules and possible exceptions thereof for many medical areas. Consensus rules also enable an automatic evaluation of the model's inference process regarding its derivation correctness (evaluating correct and faithful deduction of a conclusion from given premises) and value correctness (comparing predicted values against real-world measurements). We exemplify our work using the complex Sepsis-3 consensus definition. Our experiments show that small fine-tuned models outperform one-shot learning of considerably larger LLMs that are prompted with the explicit definition and models that are trained on medical texts including consensus definitions. Since fine-tuning on verbalized rule instantiations of a specific medical area yields nearly perfect derivation correctness for rules (and exceptions) on unseen patient data in that area, the bottleneck for early prediction is not out-of-distribution generalization, but the orthogonal problem of generalization into the future by forecasting sparsely and irregularly sampled clinical variables. We show that the latter results can be improved by integrating the output representations of a time series forecasting model with the LLM in a multimodal setup.
