Trustworthy and Practical AI for Healthcare: A Guided Deferral System with Large Language Models
Joshua Strong, Qianhui Men, Alison Noble
TL;DR
This paper tackles the challenge of trustworthy AI in healthcare by proposing a guided deferral system that combines open-source LLMs with human decision-makers in a Human-AI Collaboration (HAIC) framework. It integrates three prediction sources—verbalised text, hidden-state representations, and their combination—to classify clinical findings while deferring uncertain cases to clinicians with intelligent guidance. A novel Imbalanced Expected Calibration Error (ECE_Imb) addresses calibration in imbalanced healthcare data, and the system is implemented with efficient instruction-tuning on small open-source LLMs, enabling deployment on modest hardware. A pilot study and extensive experiments on OSCLMRIC and MIMIC-500 demonstrate that guided deferral improves human decision-making, yields strong calibration and deferral performance, and remains competitive with proprietary state-of-the-art models while preserving data privacy and openness for broader adoption.
Abstract
Large language models (LLMs) offer a valuable technology for various applications in healthcare. However, their tendency to hallucinate and the existing reliance on proprietary systems pose challenges in environments concerning critical decision-making and strict data privacy regulations, such as healthcare, where the trust in such systems is paramount. Through combining the strengths and discounting the weaknesses of humans and AI, the field of Human-AI Collaboration (HAIC) presents one front for tackling these challenges and hence improving trust. This paper presents a novel HAIC guided deferral system that can simultaneously parse medical reports for disorder classification, and defer uncertain predictions with intelligent guidance to humans. We develop methodology which builds efficient, effective and open-source LLMs for this purpose, for the real-world deployment in healthcare. We conduct a pilot study which showcases the effectiveness of our proposed system in practice. Additionally, we highlight drawbacks of standard calibration metrics in imbalanced data scenarios commonly found in healthcare, and suggest a simple yet effective solution: the Imbalanced Expected Calibration Error.
