Preservation of Feature Stability in Machine Learning Under Data Uncertainty for Decision Support in Critical Domains

Karol Capała; Paulina Tworek; Jose Sousa

Preservation of Feature Stability in Machine Learning Under Data Uncertainty for Decision Support in Critical Domains

Karol Capała, Paulina Tworek, Jose Sousa

TL;DR

Decision support in critical domains is hindered by data incompleteness that undermines explainability and input stability. The study compares Random Forest and Gradient Boosting with a descriptive approach (PPA) that uses ROC- and quantile-based abstractions and a knowledge-graph representation; the classifier computes $P(X|C_j)=\prod_{i=1}^{n}P(x_i|C_j)$ and ignores missing features. Two abstraction schemes, fixed-width/quantile-based with up to $n_{\max}$ abstractions given by $n_{\max}=\frac{c^2 R^2}{(c-1) z^2} N$ (general) and $n_{\max}=\frac{4 R^2}{z^2} N$ (binary), are shown to improve accuracy and especially robustness to data incompleteness. Results demonstrate that descriptive classification with these abstractions can match or exceed traditional explainable ML in critical decision-making tasks while offering stronger resilience to missing data and preserved feature significance, supporting deployment in uncertain decision contexts.

Abstract

In a world where Machine Learning (ML) is increasingly deployed to support decision-making in critical domains, providing decision-makers with explainable, stable, and relevant inputs becomes fundamental. Understanding how machine learning works under missing data and how this affects feature variability is paramount. This is even more relevant as machine learning approaches focus on standardising decision-making approaches that rely on an idealised set of features. However, decision-making in human activities often relies on incomplete data, even in critical domains. This paper addresses this gap by conducting a set of experiments using traditional machine learning methods that look for optimal decisions in comparison to a recently deployed machine learning method focused on a classification that is more descriptive and mimics human decision making, allowing for the natural integration of explainability. We found that the ML descriptive approach maintains higher classification accuracy while ensuring the stability of feature selection as data incompleteness increases. This suggests that descriptive classification methods can be helpful in uncertain decision-making scenarios.

Preservation of Feature Stability in Machine Learning Under Data Uncertainty for Decision Support in Critical Domains

TL;DR

and ignores missing features. Two abstraction schemes, fixed-width/quantile-based with up to

abstractions given by

(general) and

(binary), are shown to improve accuracy and especially robustness to data incompleteness. Results demonstrate that descriptive classification with these abstractions can match or exceed traditional explainable ML in critical decision-making tasks while offering stronger resilience to missing data and preserved feature significance, supporting deployment in uncertain decision contexts.

Abstract

Paper Structure (22 sections, 15 equations, 6 figures)

This paper contains 22 sections, 15 equations, 6 figures.

Introduction
Related work
Random Forest (RF)
Boosting Trees (BT)
Previously proposed approach (PPA)
Abstractions
ROC Curves
static binning
Quantiles
Limitation on number of abstractions
Experiments
Verification metrics
Datasets
Experiment design
Choice of abstractions
...and 7 more sections

Figures (6)

Figure 1: Balanced accuracy (BA) of the PPA classification method as a function of the different abstraction methods.
Figure 2: Balanced accuracy (BA) as a function of percentage of missing data for PPA with two best performing abstraction methods (order quantiles 20 and 20 bins), PPA with ROC curve, Random Forest and Gradient Boosting.
Figure 3: Precision as a function of percentage of missing data for PPA with two best-performing abstraction methods (order quantiles 20 and 20 bins), PPA with ROC curve, Random Forest and Gradient Boosting.
Figure 4: Recall as a function of percentage of missing data for PPA with two best-performing abstraction methods (order quantiles 20 and 20 bins), PPA with ROC curve, Random Forest and Gradient Boosting.
Figure 5: Most important features for the classification of DIGEN39_5578 for PPA with quantiles 20 abstractions (top), Random Forest (middle) and Gradient Boosting (bottom) for different levels of missing data.
...and 1 more figures

Preservation of Feature Stability in Machine Learning Under Data Uncertainty for Decision Support in Critical Domains

TL;DR

Abstract

Preservation of Feature Stability in Machine Learning Under Data Uncertainty for Decision Support in Critical Domains

Authors

TL;DR

Abstract

Table of Contents

Figures (6)