Application of the representative measure approach to assess the reliability of decision trees in dealing with unseen vehicle collision data

Javier Perera-Lago; Víctor Toscano-Durán; Eduardo Paluzo-Hidalgo; Sara Narteni; Matteo Rucco

Application of the representative measure approach to assess the reliability of decision trees in dealing with unseen vehicle collision data

Javier Perera-Lago, Víctor Toscano-Durán, Eduardo Paluzo-Hidalgo, Sara Narteni, Matteo Rucco

TL;DR

The paper addresses how a topology-based $\varepsilon$-representativeness measure can quantify dataset similarity to ensure reliable decisions by trees on unseen vehicle-collision data. It provides a theoretical guarantee that, for a $\gamma$-balanced $\varepsilon$-representative dataset with $\varepsilon<M=\min_{i\in I}\mu_i$, a binary decision tree preserves accuracy when trained on the representative subset. Empirically, it shows that the $\varepsilon$-representativeness level correlates with the similarity of feature-importance ordering for both binary DTs and XGBoost on synthetic and real collision data, with Spearman correlations around 0.51 and 0.673 respectively across multiple subsets. These results support using topology-based representativeness to gauge model explanations and reliability in tabular data, suggesting directions for future theoretical guarantees on feature-ordering and decision-rule comparisons.

Abstract

Machine learning algorithms are fundamental components of novel data-informed Artificial Intelligence architecture. In this domain, the imperative role of representative datasets is a cornerstone in shaping the trajectory of artificial intelligence (AI) development. Representative datasets are needed to train machine learning components properly. Proper training has multiple impacts: it reduces the final model's complexity, power, and uncertainties. In this paper, we investigate the reliability of the $\varepsilon$-representativeness method to assess the dataset similarity from a theoretical perspective for decision trees. We decided to focus on the family of decision trees because it includes a wide variety of models known to be explainable. Thus, in this paper, we provide a result guaranteeing that if two datasets are related by $\varepsilon$-representativeness, i.e., both of them have points closer than $\varepsilon$, then the predictions by the classic decision tree are similar. Experimentally, we have also tested that $\varepsilon$-representativeness presents a significant correlation with the ordering of the feature importance. Moreover, we extend the results experimentally in the context of unseen vehicle collision data for XGboost, a machine-learning component widely adopted for dealing with tabular data.

Application of the representative measure approach to assess the reliability of decision trees in dealing with unseen vehicle collision data

TL;DR

The paper addresses how a topology-based

-representativeness measure can quantify dataset similarity to ensure reliable decisions by trees on unseen vehicle-collision data. It provides a theoretical guarantee that, for a

-balanced

-representative dataset with

, a binary decision tree preserves accuracy when trained on the representative subset. Empirically, it shows that the

-representativeness level correlates with the similarity of feature-importance ordering for both binary DTs and XGBoost on synthetic and real collision data, with Spearman correlations around 0.51 and 0.673 respectively across multiple subsets. These results support using topology-based representativeness to gauge model explanations and reliability in tabular data, suggesting directions for future theoretical guarantees on feature-ordering and decision-rule comparisons.

Abstract

-representativeness method to assess the dataset similarity from a theoretical perspective for decision trees. We decided to focus on the family of decision trees because it includes a wide variety of models known to be explainable. Thus, in this paper, we provide a result guaranteeing that if two datasets are related by

-representativeness, i.e., both of them have points closer than

, then the predictions by the classic decision tree are similar. Experimentally, we have also tested that

-representativeness presents a significant correlation with the ordering of the feature importance. Moreover, we extend the results experimentally in the context of unseen vehicle collision data for XGboost, a machine-learning component widely adopted for dealing with tabular data.

Paper Structure (3 sections, 1 theorem, 3 equations)

This paper contains 3 sections, 1 theorem, 3 equations.

Introduction
Classification with decision trees and XGBoost
Representativeness and decision trees

Key Result

theorem thmcountertheorem

Let $T \in \mathcal{T}$ be a binary DT, $(X,\lambda_X)$ a dataset and $(\Tilde{X},\lambda_{\Tilde{X}})$ a $\gamma$-balanced $\varepsilon$-representative dataset of $(X,\lambda_X)$. If $\varepsilon < M = \min_{i \in I} \mu_i$, then

Theorems & Definitions (1)

theorem thmcountertheorem

Application of the representative measure approach to assess the reliability of decision trees in dealing with unseen vehicle collision data

TL;DR

Abstract

Application of the representative measure approach to assess the reliability of decision trees in dealing with unseen vehicle collision data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (1)