A feature-stable and explainable machine learning framework for trustworthy decision-making under incomplete clinical data

Justyna Andrys-Olek; Paulina Tworek; Luca Gherardini; Mark W. Ruddock; Mary Jo Kurt; Peter Fitzgerald; Jose Sousa

A feature-stable and explainable machine learning framework for trustworthy decision-making under incomplete clinical data

Justyna Andrys-Olek, Paulina Tworek, Luca Gherardini, Mark W. Ruddock, Mary Jo Kurt, Peter Fitzgerald, Jose Sousa

TL;DR

CACTUS (Comprehensive Abstraction and Classification Tool for Uncovering Structures), an explainable machine learning framework explicitly designed to address challenges in small, heterogeneous, and incomplete clinical datasets, is presented.

Abstract

Machine learning models are increasingly applied to biomedical data, yet their adoption in high stakes domains remains limited by poor robustness, limited interpretability, and instability of learned features under realistic data perturbations, such as missingness. In particular, models that achieve high predictive performance may still fail to inspire trust if their key features fluctuate when data completeness changes, undermining reproducibility and downstream decision-making. Here, we present CACTUS (Comprehensive Abstraction and Classification Tool for Uncovering Structures), an explainable machine learning framework explicitly designed to address these challenges in small, heterogeneous, and incomplete clinical datasets. CACTUS integrates feature abstraction, interpretable classification, and systematic feature stability analysis to quantify how consistently informative features are preserved as data quality degrades. Using a real-world haematuria cohort comprising 568 patients evaluated for bladder cancer, we benchmark CACTUS against widely used machine learning approaches, including random forests and gradient boosting methods, under controlled levels of randomly introduced missing data. We demonstrate that CACTUS achieves competitive or superior predictive performance while maintaining markedly higher stability of top-ranked features as missingness increases, including in sex-stratified analyses. Our results show that feature stability provides information complementary to conventional performance metrics and is essential for assessing the trustworthiness of machine learning models applied to biomedical data. By explicitly quantifying robustness to missing data and prioritising interpretable, stable features, CACTUS offers a generalizable framework for trustworthy data-driven decision support.

A feature-stable and explainable machine learning framework for trustworthy decision-making under incomplete clinical data

TL;DR

Abstract

Paper Structure (21 sections, 7 figures, 2 tables)

This paper contains 21 sections, 7 figures, 2 tables.

SUMMARY
KEYWORDS
INTRODUCTION
RESULTS AND DISCUSSION
DATA AND METHODS
CONCLUSIONS
RESOURCE AVAILABILITY
ACKNOWLEDGMENTS
AUTHOR CONTRIBUTIONS
DECLARATION OF INTERESTS
DECLARATION OF GENERATIVE AI AND AI-ASSISTED TECHNOLOGIES
SUPPLEMENTAL INFORMATION INDEX

Figures (7)

Figure 1: Stability of features across ML methods. Average relative change in feature importance calculated for the 10 most important features for classification obtained by each method for: the total dataset (top), the male subjects in the dataset (middle), and the female subjects in the dataset (bottom).
Figure 2: Overlapping features. Graphs plotted for each subset from the total population, males and females, presenting percentage of overlapping features from the top 10 most important features for classification with increasing number of missing values in the dataset.
Figure 3: Balanced Accuracy (BA). BA calculated for each subset (total, males and females) with increasing number of missing values in the dataset.
Figure 4: Recall (sensitivity). Sensitivity calculated for each subset (total, males and females) with increasing number of missing values in the dataset.
Figure 5: Heatmap for males subset. A heatmap showing the 10 most important features for BC/non-BC case classification in the male subset, with thresholds and significance values used to produce the ranks. Note: the thresholds shown on the graph represent the best value that separates the two classes (BC and non-BC cases) and do not correspond to thresholds set by medical institutions for diagnosis.
...and 2 more figures

A feature-stable and explainable machine learning framework for trustworthy decision-making under incomplete clinical data

TL;DR

Abstract

A feature-stable and explainable machine learning framework for trustworthy decision-making under incomplete clinical data

Authors

TL;DR

Abstract

Table of Contents

Figures (7)