Table of Contents
Fetching ...

Filtering instances and rejecting predictions to obtain reliable models in healthcare

Maria Gabriela Valeriano, David Kohan Marzagão, Alfredo Montelongo, Carlos Roberto Veiga Kiffer, Natan Katz, Ana Carolina Lorena

TL;DR

The paper tackles reliability in healthcare ML by pairing data-centric data refinement with a safety-oriented rejection mechanism. It introduces Instance Hardness (IH) as a consensus-based metric to prune hard training examples and couples this with a calibrated confidence-based reject option at inference, guided by a cost function balancing accuracy, confidence, and coverage. Key contributions include a detailed framework and a heuristic for threshold selection, evaluation on three real-world Brazilian health datasets, and baseline comparisons using influence values and uncertainty-based rejection. The results show that IH filtering together with confidence-based rejection improves predictive reliability while preserving a large share of predictions, supporting practical deployment in safety-critical settings. The framework is adaptable, data-centric, and emphasizes transparent, risk-aware decision-making for real-world healthcare applications.

Abstract

Machine Learning (ML) models are widely used in high-stakes domains such as healthcare, where the reliability of predictions is critical. However, these models often fail to account for uncertainty, providing predictions even with low confidence. This work proposes a novel two-step data-centric approach to enhance the performance of ML models by improving data quality and filtering low-confidence predictions. The first step involves leveraging Instance Hardness (IH) to filter problematic instances during training, thereby refining the dataset. The second step introduces a confidence-based rejection mechanism during inference, ensuring that only reliable predictions are retained. We evaluate our approach using three real-world healthcare datasets, demonstrating its effectiveness at improving model reliability while balancing predictive performance and rejection rate. Additionally, we use alternative criteria - influence values for filtering and uncertainty for rejection - as baselines to evaluate the efficiency of the proposed method. The results demonstrate that integrating IH filtering with confidence-based rejection effectively enhances model performance while preserving a large proportion of instances. This approach provides a practical method for deploying ML systems in safety-critical applications.

Filtering instances and rejecting predictions to obtain reliable models in healthcare

TL;DR

The paper tackles reliability in healthcare ML by pairing data-centric data refinement with a safety-oriented rejection mechanism. It introduces Instance Hardness (IH) as a consensus-based metric to prune hard training examples and couples this with a calibrated confidence-based reject option at inference, guided by a cost function balancing accuracy, confidence, and coverage. Key contributions include a detailed framework and a heuristic for threshold selection, evaluation on three real-world Brazilian health datasets, and baseline comparisons using influence values and uncertainty-based rejection. The results show that IH filtering together with confidence-based rejection improves predictive reliability while preserving a large share of predictions, supporting practical deployment in safety-critical settings. The framework is adaptable, data-centric, and emphasizes transparent, risk-aware decision-making for real-world healthcare applications.

Abstract

Machine Learning (ML) models are widely used in high-stakes domains such as healthcare, where the reliability of predictions is critical. However, these models often fail to account for uncertainty, providing predictions even with low confidence. This work proposes a novel two-step data-centric approach to enhance the performance of ML models by improving data quality and filtering low-confidence predictions. The first step involves leveraging Instance Hardness (IH) to filter problematic instances during training, thereby refining the dataset. The second step introduces a confidence-based rejection mechanism during inference, ensuring that only reliable predictions are retained. We evaluate our approach using three real-world healthcare datasets, demonstrating its effectiveness at improving model reliability while balancing predictive performance and rejection rate. Additionally, we use alternative criteria - influence values for filtering and uncertainty for rejection - as baselines to evaluate the efficiency of the proposed method. The results demonstrate that integrating IH filtering with confidence-based rejection effectively enhances model performance while preserving a large proportion of instances. This approach provides a practical method for deploying ML systems in safety-critical applications.

Paper Structure

This paper contains 31 sections, 8 equations, 11 figures, 10 tables, 2 algorithms.

Figures (11)

  • Figure 1: Average macro-F1 and rate of accepted instances computed across five train-validation splits. The acceptance rate is relative to the size of the validation set. Models were trained using XGBoost with varying filtering threshold values ($T_f$) and evaluated across a range of rejection thresholds ($T_r$). Experiments were conducted using the severity_hsl dataset. As $T_r$ increases, predictive performance improves for models trained with lower $T_f$ values. Higher $T_f$ values result in more stable performance metrics across different $T_r$ levels.
  • Figure 2: Cost values per filtering threshold applied to the severity_hsl dataset. Instances were filtered based on IH values, and confidence-based rejection was applied. For each filtering threshold we selected the rejection threshold that minimizes the cost. Values are averages over five different train-validation splits. The first (left) plot presents the initial set of filtering thresholds tested, whereas the second plot (right) includes additional values identified using a heuristic approach.
  • Figure 3: Framework used in this study. The main dataset is divided into training and validation sets. IH values are estimated from the training set, and instances are filtered using different thresholds ($T_f$). The filtered data is used to train a classifier, which is evaluated on the validation set at various rejection thresholds ($T_r$) based on classifier confidence. This process is repeated 5 times, and the average results are used to determine suitable threshold values ($T_f$ and $T_r$) by minimizing a cost function using a heuristic. Finally, the model is trained on the full dataset and evaluated on a separate test set.
  • Figure : (a) Confidence methods.
  • Figure : (a) Grid search.
  • ...and 6 more figures