Table of Contents
Fetching ...

Application of the representative measure approach to assess the reliability of decision trees in dealing with unseen vehicle collision data

Javier Perera-Lago, Víctor Toscano-Durán, Eduardo Paluzo-Hidalgo, Sara Narteni, Matteo Rucco

TL;DR

The paper addresses how a topology-based $\varepsilon$-representativeness measure can quantify dataset similarity to ensure reliable decisions by trees on unseen vehicle-collision data. It provides a theoretical guarantee that, for a $\gamma$-balanced $\varepsilon$-representative dataset with $\varepsilon<M=\min_{i\in I}\mu_i$, a binary decision tree preserves accuracy when trained on the representative subset. Empirically, it shows that the $\varepsilon$-representativeness level correlates with the similarity of feature-importance ordering for both binary DTs and XGBoost on synthetic and real collision data, with Spearman correlations around 0.51 and 0.673 respectively across multiple subsets. These results support using topology-based representativeness to gauge model explanations and reliability in tabular data, suggesting directions for future theoretical guarantees on feature-ordering and decision-rule comparisons.

Abstract

Machine learning algorithms are fundamental components of novel data-informed Artificial Intelligence architecture. In this domain, the imperative role of representative datasets is a cornerstone in shaping the trajectory of artificial intelligence (AI) development. Representative datasets are needed to train machine learning components properly. Proper training has multiple impacts: it reduces the final model's complexity, power, and uncertainties. In this paper, we investigate the reliability of the $\varepsilon$-representativeness method to assess the dataset similarity from a theoretical perspective for decision trees. We decided to focus on the family of decision trees because it includes a wide variety of models known to be explainable. Thus, in this paper, we provide a result guaranteeing that if two datasets are related by $\varepsilon$-representativeness, i.e., both of them have points closer than $\varepsilon$, then the predictions by the classic decision tree are similar. Experimentally, we have also tested that $\varepsilon$-representativeness presents a significant correlation with the ordering of the feature importance. Moreover, we extend the results experimentally in the context of unseen vehicle collision data for XGboost, a machine-learning component widely adopted for dealing with tabular data.

Application of the representative measure approach to assess the reliability of decision trees in dealing with unseen vehicle collision data

TL;DR

The paper addresses how a topology-based -representativeness measure can quantify dataset similarity to ensure reliable decisions by trees on unseen vehicle-collision data. It provides a theoretical guarantee that, for a -balanced -representative dataset with , a binary decision tree preserves accuracy when trained on the representative subset. Empirically, it shows that the -representativeness level correlates with the similarity of feature-importance ordering for both binary DTs and XGBoost on synthetic and real collision data, with Spearman correlations around 0.51 and 0.673 respectively across multiple subsets. These results support using topology-based representativeness to gauge model explanations and reliability in tabular data, suggesting directions for future theoretical guarantees on feature-ordering and decision-rule comparisons.

Abstract

Machine learning algorithms are fundamental components of novel data-informed Artificial Intelligence architecture. In this domain, the imperative role of representative datasets is a cornerstone in shaping the trajectory of artificial intelligence (AI) development. Representative datasets are needed to train machine learning components properly. Proper training has multiple impacts: it reduces the final model's complexity, power, and uncertainties. In this paper, we investigate the reliability of the -representativeness method to assess the dataset similarity from a theoretical perspective for decision trees. We decided to focus on the family of decision trees because it includes a wide variety of models known to be explainable. Thus, in this paper, we provide a result guaranteeing that if two datasets are related by -representativeness, i.e., both of them have points closer than , then the predictions by the classic decision tree are similar. Experimentally, we have also tested that -representativeness presents a significant correlation with the ordering of the feature importance. Moreover, we extend the results experimentally in the context of unseen vehicle collision data for XGboost, a machine-learning component widely adopted for dealing with tabular data.
Paper Structure (3 sections, 1 theorem, 3 equations)

This paper contains 3 sections, 1 theorem, 3 equations.

Key Result

theorem thmcountertheorem

Let $T \in \mathcal{T}$ be a binary DT, $(X,\lambda_X)$ a dataset and $(\Tilde{X},\lambda_{\Tilde{X}})$ a $\gamma$-balanced $\varepsilon$-representative dataset of $(X,\lambda_X)$. If $\varepsilon < M = \min_{i \in I} \mu_i$, then

Theorems & Definitions (1)

  • theorem thmcountertheorem