Keeping Medical AI Healthy and Trustworthy: A Review of Detection and Correction Methods for System Degradation

Hao Guan; David Bates; Li Zhou

Keeping Medical AI Healthy and Trustworthy: A Review of Detection and Correction Methods for System Degradation

Hao Guan, David Bates, Li Zhou

TL;DR

This review presents a comprehensive Detection-Diagnosis-Correction framework for maintaining health and trustworthiness of medical AI, distinguishing data drift from model drift and detailing data-shift, data-anomaly, and calibration issues. It surveys a broad spectrum of monitoring techniques (statistical, ML-based, feature-based, and OOD) and outlines RCA and correction strategies (domain adaptation, retraining, continual learning, and calibration) across data and model faults. The paper also catalogs datasets, tools, and benchmarks, and discusses challenges such as delayed labels, fairness, and multimodal monitoring, while proposing future directions including label-free estimation and RCA for LLM/vision-language systems. Overall, it provides a structured roadmap for sustaining safe, robust AI-assisted care in dynamic clinical environments.

Abstract

Artificial intelligence (AI) is increasingly integrated into modern healthcare, offering powerful support for clinical decision-making. However, in real-world settings, AI systems may experience performance degradation over time, due to factors such as shifting data distributions, changes in patient characteristics, evolving clinical protocols, and variations in data quality. These factors can compromise model reliability, posing safety concerns and increasing the likelihood of inaccurate predictions or adverse outcomes. This review presents a forward-looking perspective on monitoring and maintaining the "health" of AI systems in healthcare. We highlight the urgent need for continuous performance monitoring, early degradation detection, and effective self-correction mechanisms. The paper begins by reviewing common causes of performance degradation at both data and model levels. We then summarize key techniques for detecting data and model drift, followed by an in-depth look at root cause analysis. Correction strategies are further reviewed, ranging from model retraining to test-time adaptation. Our survey spans both traditional machine learning models and state-of-the-art large language models (LLMs), offering insights into their strengths and limitations. Finally, we discuss ongoing technical challenges and propose future research directions. This work aims to guide the development of reliable, robust medical AI systems capable of sustaining safe, long-term deployment in dynamic clinical settings.

Keeping Medical AI Healthy and Trustworthy: A Review of Detection and Correction Methods for System Degradation

TL;DR

Abstract

Keeping Medical AI Healthy and Trustworthy: A Review of Detection and Correction Methods for System Degradation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)