Tree Ensemble Explainability through the Hoeffding Functional Decomposition and TreeHFD Algorithm
Clément Bénard
TL;DR
This paper introduces TreeHFD, a data-driven algorithm to estimate the Hoeffding functional decomposition (HFD) of tree ensembles from samples when input distributions are unknown. It formalizes a discretized, piecewise-constant HFD for tree ensembles using Cartesian partitions, proving convergence and key properties such as orthogonality, sparsity, and causal variable selection. The authors empirically demonstrate that TreeHFD achieves accurate reconstructions of the HFD on both analytical and real datasets and reveal strong connections between TreeSHAP and HFD, including improved stability and interpretability over TreeSHAP. The approach offers a principled, scalable path to explainability for tree ensembles in standard ML settings, with clear limitations related to deep trees and access to model internals.
Abstract
Tree ensembles have demonstrated state-of-the-art predictive performance across a wide range of problems involving tabular data. Nevertheless, the black-box nature of tree ensembles is a strong limitation, especially for applications with critical decisions at stake. The Hoeffding or ANOVA functional decomposition is a powerful explainability method, as it breaks down black-box models into a unique sum of lower-dimensional functions, provided that input variables are independent. In standard learning settings, input variables are often dependent, and the Hoeffding decomposition is generalized through hierarchical orthogonality constraints. Such generalization leads to unique and sparse decompositions with well-defined main effects and interactions. However, the practical estimation of this decomposition from a data sample is still an open problem. Therefore, we introduce the TreeHFD algorithm to estimate the Hoeffding decomposition of a tree ensemble from a data sample. We show the convergence of TreeHFD, along with the main properties of orthogonality, sparsity, and causal variable selection. The high performance of TreeHFD is demonstrated through experiments on both simulated and real data, using our treehfd Python package (https://github.com/ThalesGroup/treehfd). Besides, we empirically show that the widely used TreeSHAP method, based on Shapley values, is strongly connected to the Hoeffding decomposition.
