Extending Explainable Ensemble Trees (E2Tree) to regression contexts
Massimo Aria, Agostino Gnasso, Carmela Iorio, Marjolein Fokkema
TL;DR
This work tackles explainability for random forest regression by extending E2Tree, a post hoc, tree-like explanation framework that uses a dissimilarity-based representation derived from co-occurrence of observations in RF nodes. It defines the co-occurrence measure $O_{ij}$ and derives a dissimilarity $d_{ij}=1-O_{ij}$, coupled with NMSE-based stopping criteria and a Mann–Whitney validation to prune branches, yielding a global explanation $\hat{O}_{ij}$ that faithfully reflects the RF structure. The approach maintains predictive accuracy while making predictor effects and interactions transparent, demonstrated on Iris and Auto MPG with FMI values around $0.90$ and $0.75$ respectively, and visualized via heatmaps and path diagrams. This work sets the stage for applying E2Tree to other ensemble trees and motivates the development of fidelity metrics for explanations, with available code in the e2Tree package.
Abstract
Ensemble methods such as random forests have transformed the landscape of supervised learning, offering highly accurate prediction through the aggregation of multiple weak learners. However, despite their effectiveness, these methods often lack transparency, impeding users' comprehension of how RF models arrive at their predictions. Explainable ensemble trees (E2Tree) is a novel methodology for explaining random forests, that provides a graphical representation of the relationship between response variables and predictors. A striking characteristic of E2Tree is that it not only accounts for the effects of predictor variables on the response but also accounts for associations between the predictor variables through the computation and use of dissimilarity measures. The E2Tree methodology was initially proposed for use in classification tasks. In this paper, we extend the methodology to encompass regression contexts. To demonstrate the explanatory power of the proposed algorithm, we illustrate its use on real-world datasets.
