Table of Contents
Fetching ...

Uncertainty-aware abstention in medical diagnosis based on medical texts

Artem Vazhentsev, Ivan Sviridov, Alvard Barseghyan, Gleb Kuzmin, Alexander Panchenko, Aleksandr Nesterov, Artem Shelmanov, Maxim Panov

TL;DR

This work tackles the reliability of AI-driven medical diagnosis from textual data by evaluating uncertainty quantification methods for selective prediction (abstention). It advances the field with a comprehensive multi-task evaluation across mortality prediction, ICD-10 code prediction, outpatient diagnoses, and mental-health text classification, and introduces HUQ-2, a hybrid method that better balances aleatoric and epistemic uncertainty. The study demonstrates that HUQ-based approaches frequently outperform baselines on both instance-wise and label-wise abstention, with notable gains in real-world datasets like Outpatient Visits and multi-label MCP, while also highlighting limitations such as the lack of guaranteed high uncertainty for all errors and the need for validation data for hyperparameters. Overall, the findings support uncertainty-aware abstention as a viable path toward safer, more interpretable, and clinically reliable AI-assisted medical diagnosis pipelines, and point to promising directions in label-wise uncertainty and LLM integration for medical texts.

Abstract

This study addresses the critical issue of reliability for AI-assisted medical diagnosis. We focus on the selection prediction approach that allows the diagnosis system to abstain from providing the decision if it is not confident in the diagnosis. Such selective prediction (or abstention) approaches are usually based on the modeling predictive uncertainty of machine learning models involved. This study explores uncertainty quantification in machine learning models for medical text analysis, addressing diverse tasks across multiple datasets. We focus on binary mortality prediction from textual data in MIMIC-III, multi-label medical code prediction using ICD-10 codes from MIMIC-IV, and multi-class classification with a private outpatient visits dataset. Additionally, we analyze mental health datasets targeting depression and anxiety detection, utilizing various text-based sources, such as essays, social media posts, and clinical descriptions. In addition to comparing uncertainty methods, we introduce HUQ-2, a new state-of-the-art method for enhancing reliability in selective prediction tasks. Our results provide a detailed comparison of uncertainty quantification methods. They demonstrate the effectiveness of HUQ-2 in capturing and evaluating uncertainty, paving the way for more reliable and interpretable applications in medical text analysis.

Uncertainty-aware abstention in medical diagnosis based on medical texts

TL;DR

This work tackles the reliability of AI-driven medical diagnosis from textual data by evaluating uncertainty quantification methods for selective prediction (abstention). It advances the field with a comprehensive multi-task evaluation across mortality prediction, ICD-10 code prediction, outpatient diagnoses, and mental-health text classification, and introduces HUQ-2, a hybrid method that better balances aleatoric and epistemic uncertainty. The study demonstrates that HUQ-based approaches frequently outperform baselines on both instance-wise and label-wise abstention, with notable gains in real-world datasets like Outpatient Visits and multi-label MCP, while also highlighting limitations such as the lack of guaranteed high uncertainty for all errors and the need for validation data for hyperparameters. Overall, the findings support uncertainty-aware abstention as a viable path toward safer, more interpretable, and clinically reliable AI-assisted medical diagnosis pipelines, and point to promising directions in label-wise uncertainty and LLM integration for medical texts.

Abstract

This study addresses the critical issue of reliability for AI-assisted medical diagnosis. We focus on the selection prediction approach that allows the diagnosis system to abstain from providing the decision if it is not confident in the diagnosis. Such selective prediction (or abstention) approaches are usually based on the modeling predictive uncertainty of machine learning models involved. This study explores uncertainty quantification in machine learning models for medical text analysis, addressing diverse tasks across multiple datasets. We focus on binary mortality prediction from textual data in MIMIC-III, multi-label medical code prediction using ICD-10 codes from MIMIC-IV, and multi-class classification with a private outpatient visits dataset. Additionally, we analyze mental health datasets targeting depression and anxiety detection, utilizing various text-based sources, such as essays, social media posts, and clinical descriptions. In addition to comparing uncertainty methods, we introduce HUQ-2, a new state-of-the-art method for enhancing reliability in selective prediction tasks. Our results provide a detailed comparison of uncertainty quantification methods. They demonstrate the effectiveness of HUQ-2 in capturing and evaluating uncertainty, paving the way for more reliable and interpretable applications in medical text analysis.

Paper Structure

This paper contains 40 sections, 15 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: An illustration of the verification pipeline in medicine based on uncertainty quantification. The most uncertain predictions are checked additionally by medical professionals.
  • Figure 2: Rejection curves for the selected methods for the considered tasks.
  • Figure 3: Rejection curves for the selected methods for the MIMIC medical code prediction task for general rejection methods vs label-wise MP approach. HUQ hyperparameters are fitted using the accuracy rejection metric. The HUQ and HUQ-2 methods overlap with the MP method due to the selected hyperparameters on the validation set.
  • Figure 4: Full rejection curves for the selected methods for the considered tasks.
  • Figure 5: Full rejection curves for the selected methods for the MIMIC medical code prediction task for general rejection vs label-wise approach. Due to the selected hyperparameters on the validation set, the HUQ and HUQ-2 methods overlap with the MP, MD, and DDU methods.
  • ...and 2 more figures