Trustworthy and Practical AI for Healthcare: A Guided Deferral System with Large Language Models

Joshua Strong; Qianhui Men; Alison Noble

Trustworthy and Practical AI for Healthcare: A Guided Deferral System with Large Language Models

Joshua Strong, Qianhui Men, Alison Noble

TL;DR

This paper tackles the challenge of trustworthy AI in healthcare by proposing a guided deferral system that combines open-source LLMs with human decision-makers in a Human-AI Collaboration (HAIC) framework. It integrates three prediction sources—verbalised text, hidden-state representations, and their combination—to classify clinical findings while deferring uncertain cases to clinicians with intelligent guidance. A novel Imbalanced Expected Calibration Error (ECE_Imb) addresses calibration in imbalanced healthcare data, and the system is implemented with efficient instruction-tuning on small open-source LLMs, enabling deployment on modest hardware. A pilot study and extensive experiments on OSCLMRIC and MIMIC-500 demonstrate that guided deferral improves human decision-making, yields strong calibration and deferral performance, and remains competitive with proprietary state-of-the-art models while preserving data privacy and openness for broader adoption.

Abstract

Large language models (LLMs) offer a valuable technology for various applications in healthcare. However, their tendency to hallucinate and the existing reliance on proprietary systems pose challenges in environments concerning critical decision-making and strict data privacy regulations, such as healthcare, where the trust in such systems is paramount. Through combining the strengths and discounting the weaknesses of humans and AI, the field of Human-AI Collaboration (HAIC) presents one front for tackling these challenges and hence improving trust. This paper presents a novel HAIC guided deferral system that can simultaneously parse medical reports for disorder classification, and defer uncertain predictions with intelligent guidance to humans. We develop methodology which builds efficient, effective and open-source LLMs for this purpose, for the real-world deployment in healthcare. We conduct a pilot study which showcases the effectiveness of our proposed system in practice. Additionally, we highlight drawbacks of standard calibration metrics in imbalanced data scenarios commonly found in healthcare, and suggest a simple yet effective solution: the Imbalanced Expected Calibration Error.

Trustworthy and Practical AI for Healthcare: A Guided Deferral System with Large Language Models

TL;DR

Abstract

Paper Structure (34 sections, 8 equations, 4 figures, 3 tables)

This paper contains 34 sections, 8 equations, 4 figures, 3 tables.

Introduction
Related Work
Deferral Systems for Healthcare.
Human-AI Collaboration with LLMs.
Selective Prediction of LLMs.
Instruction-Tuning of LLMs.
On the Calibration of LLMs.
The Role of Cognitive Science in HAIC.
Methods
Sources of Predictions
Verbalised Prediction Source.
Hidden-State Prediction Source.
Combined Prediction.
Instruction-Tuning Methodology
Instruction-Tuning Data Generation.
...and 19 more sections

Figures (4)

Figure 1: Our guided deferral system. Reports are parsed by an instruction-tuned LLM for clinical disorders. From the text output, we extract a verbalised prediction$\hat{t}$. We calculate a hidden-state$\hat{\epsilon}$ prediction from the final hidden-layer of the LLM, and its combination with $\hat{t}$ through their mean $\hat{\mu}$. Uncertain predictions, determined by either $\hat{t}$, $\hat{\epsilon}$, or $\hat{\mu}$, are deferred to humans with guidance. Certain predictions are autonomously handled by the LLM. Created in BioRender. Strong, J. (2025).
Figure 2: Example guidance based on a spinal MRI report. The instruction-tuned LLM is able to intelligently infer a diagnosis with sound logic without explicit textual diagnosis.
Figure 3: Visual comparison of the effectiveness of the calibration metrics (a) $\mathrm{ECE}$, (b) $\mathrm{ACE}$ and (c) $\mathrm{ECE}_{\mathrm{Imb}}$ of $\hat{\epsilon}_{\text{IT-13B}}$ predictions against the OSCLMRIC dataset test split.
Figure 4: Accuracy of prediction methods from pilot study. On average, guided humans (right box plot) outperform both unguided humans (left box plot) and the LLM (red dashed line) alone.

Trustworthy and Practical AI for Healthcare: A Guided Deferral System with Large Language Models

TL;DR

Abstract

Trustworthy and Practical AI for Healthcare: A Guided Deferral System with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)