Table of Contents
Fetching ...

Tree Ensemble Explainability through the Hoeffding Functional Decomposition and TreeHFD Algorithm

Clément Bénard

TL;DR

This paper introduces TreeHFD, a data-driven algorithm to estimate the Hoeffding functional decomposition (HFD) of tree ensembles from samples when input distributions are unknown. It formalizes a discretized, piecewise-constant HFD for tree ensembles using Cartesian partitions, proving convergence and key properties such as orthogonality, sparsity, and causal variable selection. The authors empirically demonstrate that TreeHFD achieves accurate reconstructions of the HFD on both analytical and real datasets and reveal strong connections between TreeSHAP and HFD, including improved stability and interpretability over TreeSHAP. The approach offers a principled, scalable path to explainability for tree ensembles in standard ML settings, with clear limitations related to deep trees and access to model internals.

Abstract

Tree ensembles have demonstrated state-of-the-art predictive performance across a wide range of problems involving tabular data. Nevertheless, the black-box nature of tree ensembles is a strong limitation, especially for applications with critical decisions at stake. The Hoeffding or ANOVA functional decomposition is a powerful explainability method, as it breaks down black-box models into a unique sum of lower-dimensional functions, provided that input variables are independent. In standard learning settings, input variables are often dependent, and the Hoeffding decomposition is generalized through hierarchical orthogonality constraints. Such generalization leads to unique and sparse decompositions with well-defined main effects and interactions. However, the practical estimation of this decomposition from a data sample is still an open problem. Therefore, we introduce the TreeHFD algorithm to estimate the Hoeffding decomposition of a tree ensemble from a data sample. We show the convergence of TreeHFD, along with the main properties of orthogonality, sparsity, and causal variable selection. The high performance of TreeHFD is demonstrated through experiments on both simulated and real data, using our treehfd Python package (https://github.com/ThalesGroup/treehfd). Besides, we empirically show that the widely used TreeSHAP method, based on Shapley values, is strongly connected to the Hoeffding decomposition.

Tree Ensemble Explainability through the Hoeffding Functional Decomposition and TreeHFD Algorithm

TL;DR

This paper introduces TreeHFD, a data-driven algorithm to estimate the Hoeffding functional decomposition (HFD) of tree ensembles from samples when input distributions are unknown. It formalizes a discretized, piecewise-constant HFD for tree ensembles using Cartesian partitions, proving convergence and key properties such as orthogonality, sparsity, and causal variable selection. The authors empirically demonstrate that TreeHFD achieves accurate reconstructions of the HFD on both analytical and real datasets and reveal strong connections between TreeSHAP and HFD, including improved stability and interpretability over TreeSHAP. The approach offers a principled, scalable path to explainability for tree ensembles in standard ML settings, with clear limitations related to deep trees and access to model internals.

Abstract

Tree ensembles have demonstrated state-of-the-art predictive performance across a wide range of problems involving tabular data. Nevertheless, the black-box nature of tree ensembles is a strong limitation, especially for applications with critical decisions at stake. The Hoeffding or ANOVA functional decomposition is a powerful explainability method, as it breaks down black-box models into a unique sum of lower-dimensional functions, provided that input variables are independent. In standard learning settings, input variables are often dependent, and the Hoeffding decomposition is generalized through hierarchical orthogonality constraints. Such generalization leads to unique and sparse decompositions with well-defined main effects and interactions. However, the practical estimation of this decomposition from a data sample is still an open problem. Therefore, we introduce the TreeHFD algorithm to estimate the Hoeffding decomposition of a tree ensemble from a data sample. We show the convergence of TreeHFD, along with the main properties of orthogonality, sparsity, and causal variable selection. The high performance of TreeHFD is demonstrated through experiments on both simulated and real data, using our treehfd Python package (https://github.com/ThalesGroup/treehfd). Besides, we empirically show that the widely used TreeSHAP method, based on Shapley values, is strongly connected to the Hoeffding decomposition.

Paper Structure

This paper contains 55 sections, 15 theorems, 90 equations, 12 figures, 7 tables.

Key Result

Theorem 1

If Assumption assumption:distrib is satisfied, and $\nu$ is a square-integrable real function defined on $[0, 1]^p$, then there exists a unique set of functions ${\{\nu^{(J)}}\}_{J \in \mathcal{P}_p}$, such that for all $J \in \mathcal{P}_p$, $I \subset J$ with $I \neq J$, ${\mathbb{E}[\nu^{(J)}(\te

Figures (12)

  • Figure 1: Example of the partition of $[0,1]^2$ by a tree $T_{\ell}$ (left side), and the associated Cartesian tree partitions ${\mathcal{A}_{\ell}^{(1)} = \{A_1^{(1)}, A_2^{(1)}, A_3^{(1)}, A_4^{(1)}\}}$, ${\mathcal{A}_{\ell}^{(2)} = \{A_1^{(2)}, A_2^{(2)}, A_3^{(2)}\}}$, and ${\mathcal{A}_{\ell}^{(1,2)}}$ (right side).
  • Figure 2: Main effects of the decompositions for $X^{(1)}$ and $X^{(2)}$. Solid lines provide the theoretical functions, with the HFD in green, int. SHAP in red, and obs. SHAP in orange. Green and red points are respectively the values provided by TreeHFD and TreeSHAP with interactions for xgboost.
  • Figure 3: For the "Housing" dataset, main effects of "Longitude" and "Latitude" in the decompositions of respectively TreeHFD in blue and TreeSHAP with interactions in red.
  • Figure 4: For the analytical case, main effects of the decompositions for $X^{(3)}$ and $X^{(6)}$. Solid lines provide the theoretical functions, with the HFD in green, int. SHAP in red, and obs. SHAP in orange. Green and red points are respectively the values provided by TreeHFD and TreeSHAP with interactions for xgboost.
  • Figure 5: For the analytical case, interaction function of $X^{(1)}$ and $X^{(2)}$ estimated by TreeHFD.
  • ...and 7 more figures

Theorems & Definitions (28)

  • Theorem 1: Hoeffding Decomposition stone1994usehooker2007generalized
  • Definition 1: Cartesian Tree Partition
  • Theorem 2: HFD for Tree Ensembles
  • Theorem 3
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Corollary 1
  • Theorem 7
  • ...and 18 more