Table of Contents
Fetching ...

Regression Trees Know Calculus

Nathan Wycoff

TL;DR

This work introduces a Tree-Based Gradient Estimator (TBGE) that extracts gradient information from regression trees with piecewise-constant leaves, enabling gradient-based interpretability and uncertainty quantification for tree models. By defining local node gradients $\gamma_i$ and aggregating them across tree depth, it yields $\tilde{\nabla} f(\mathbf{x})$ and supports global (TBAS) and local (TBIG) analyses via Monte Carlo and partition-based evaluations. The authors prove consistency of the gradient- and integro-differential estimators and demonstrate practical gains in predictive performance, dimension reduction, and interpretability on real and synthetic datasets, with applications to MNIST and high-dimensional mortality data. This work opens pathways for integrating gradient-based UQ and interpretability techniques from differentiable models into the non-smooth, scalable realm of regression trees.

Abstract

Regression trees have emerged as a preeminent tool for solving real-world regression problems due to their ability to deal with nonlinearities, interaction effects and sharp discontinuities. In this article, we rather study regression trees applied to well-behaved, differentiable functions, and determine the relationship between node parameters and the local gradient of the function being approximated. We find a simple estimate of the gradient which can be efficiently computed using quantities exposed by popular tree learning libraries. This allows the tools developed in the context of differentiable algorithms, like neural nets and Gaussian processes, to be deployed to tree-based models. To demonstrate this, we study measures of model sensitivity defined in terms of integrals of gradients and demonstrate how to compute them for regression trees using the proposed gradient estimates. Quantitative and qualitative numerical experiments reveal the capability of gradients estimated by regression trees to improve predictive analysis, solve tasks in uncertainty quantification, and provide interpretation of model behavior.

Regression Trees Know Calculus

TL;DR

This work introduces a Tree-Based Gradient Estimator (TBGE) that extracts gradient information from regression trees with piecewise-constant leaves, enabling gradient-based interpretability and uncertainty quantification for tree models. By defining local node gradients and aggregating them across tree depth, it yields and supports global (TBAS) and local (TBIG) analyses via Monte Carlo and partition-based evaluations. The authors prove consistency of the gradient- and integro-differential estimators and demonstrate practical gains in predictive performance, dimension reduction, and interpretability on real and synthetic datasets, with applications to MNIST and high-dimensional mortality data. This work opens pathways for integrating gradient-based UQ and interpretability techniques from differentiable models into the non-smooth, scalable realm of regression trees.

Abstract

Regression trees have emerged as a preeminent tool for solving real-world regression problems due to their ability to deal with nonlinearities, interaction effects and sharp discontinuities. In this article, we rather study regression trees applied to well-behaved, differentiable functions, and determine the relationship between node parameters and the local gradient of the function being approximated. We find a simple estimate of the gradient which can be efficiently computed using quantities exposed by popular tree learning libraries. This allows the tools developed in the context of differentiable algorithms, like neural nets and Gaussian processes, to be deployed to tree-based models. To demonstrate this, we study measures of model sensitivity defined in terms of integrals of gradients and demonstrate how to compute them for regression trees using the proposed gradient estimates. Quantitative and qualitative numerical experiments reveal the capability of gradients estimated by regression trees to improve predictive analysis, solve tasks in uncertainty quantification, and provide interpretation of model behavior.
Paper Structure (28 sections, 5 theorems, 44 equations, 13 figures, 1 table, 1 algorithm)

This paper contains 28 sections, 5 theorems, 44 equations, 13 figures, 1 table, 1 algorithm.

Key Result

Theorem 4.1

Let $S(N)$ denote the number of splits in the tree as a function of the sample size. Let $\frac{\tilde{\partial}^N f(\mathbf{x})}{\tilde{\partial} x_{p}}$ denote the TBGE at point $\mathbf{x}\in[0,1]^P$ with sample size $N$. Under Assumptions 0, 1, and 2 of the Appendix, we have that:

Figures (13)

  • Figure 1: Illustration of Gradient Estimates. The top left gives a target function, and the bottom right gives its gradient vector field. Shown in between are estimates of the gradient extracted from a regression tree fit to data from the function converging to the true vector field.
  • Figure 2: We can extract finite difference gradient approximations from a regression tree by comparing values of adjacent nodes in splits.
  • Figure 3: Illustration of Notation. In the same example tree as Figure \ref{['fig:main_idea']} with a depth $K=2$, examples of our notation is as follows: the indices of nodes at each depth are $\mathcal{D}_1 = \{1,2\}$ and $\mathcal{D}_2 = \{3,4,5,6\}$; the children of node $2$ are $c^2_l = 5$ and $c^2_r = 6$; conversely the parent of node $5$ is given by $\rho_5 = 2$; the bounds of node $5$ are $\mathbf{l}^5=[0.5,0]$ and $\mathbf{u}^5 = [1,0.6]$; the value of intermediate node $2$ is $v_2 = 0.8$ and the value of leaf node $6$ is 0.6; since the "root node" $0$ is split along the x-axis, $\sigma_0=1$ and since nodes $1$ and $2$ are split along the y-axis, $\sigma_1 = \sigma_2 = 2$; since the point $\mathbf{x}$ lies within the nodes $2$ and $5$ at depths 1 and 2 respectively, we have that $B^1(\mathbf{x}) = 2$ and $B^2(\mathbf{x})=5$.
  • Figure 4: Integrated Gradient for Trees. Each pair of panels gives a training example from MNIST. The second pair in the image superimposes the IG values onto the example. Redder means more strongly suggesting correct class membership.
  • Figure 5: Active Subspace Estimation in Low Dimension. Execution time (x-axis) and Subspace Estimation Error (y-axis) for the four methods, lower is better.
  • ...and 8 more figures

Theorems & Definitions (11)

  • Theorem 4.1
  • Theorem 4.2
  • proof
  • proof
  • proof
  • Theorem A.1
  • proof
  • Corollary A.2
  • proof
  • Corollary A.3
  • ...and 1 more