Table of Contents
Fetching ...

Regularisation of CART trees by summation of $p$-values

Nils Engler, Mathias Lindholm, Filip Lindskog, Taariq Nazar

TL;DR

The paper addresses the instability and inefficiency of cross-validation for growing CART regression trees by introducing a deterministic, in-sample stopping rule based on node-wise $p$-values. By formulating the split decision as a change-point problem and employing a Bonferroni-based bound, the authors provide an easily computable upper bound for the tree-wide $p$-value that allows covariates of arbitrary dimension and yields asymptotic power guarantees under meaningful alternatives. They connect this approach to classical regularisation techniques, showing how the $p$-value penalty relates to pruning and information criteria, and demonstrate practical performance through extensive simulations and real-data examples, including an auto-calibrated predictor built from a GBM. Overall, the method offers a deterministic, regularised alternative to cross-validation for stopping decisions in CART trees with strong theoretical and empirical support. The approach is implemented and made accessible, enabling reliable, non-random tree construction and facilitating downstream use in boosting, distillation, and auto-calibrated prediction tasks.

Abstract

The standard procedure to decide on the complexity of a CART regression tree is to use cross-validation with the aim of obtaining a predictor that generalises well to unseen data. The randomness in the selection of folds implies that the selected CART regression tree is not a deterministic function of the data. Moreover, the cross-validation procedure may become time consuming and result in inefficient use of training data. We propose a simple deterministic in-sample method that can be used for stopping the growing of a CART regression tree based on node-wise statistical tests. This testing procedure is derived using a connection to change point detection, where the null hypothesis corresponds to no signal. The suggested $p$-value based procedure allows us to consider covariate vectors of arbitrary dimension and allows us to bound the $p$-value of an entire tree from above. Further, we show that the test detects a not too weak signal with a high probability, given a not too small sample size. We illustrate our methodology and the asymptotic results on both simulated and real world data. Additionally, we illustrate how the $p$-value based method can be used to construct a deterministic piece-wise constant auto-calibrated predictor based on a given black-box predictor.

Regularisation of CART trees by summation of $p$-values

TL;DR

The paper addresses the instability and inefficiency of cross-validation for growing CART regression trees by introducing a deterministic, in-sample stopping rule based on node-wise -values. By formulating the split decision as a change-point problem and employing a Bonferroni-based bound, the authors provide an easily computable upper bound for the tree-wide -value that allows covariates of arbitrary dimension and yields asymptotic power guarantees under meaningful alternatives. They connect this approach to classical regularisation techniques, showing how the -value penalty relates to pruning and information criteria, and demonstrate practical performance through extensive simulations and real-data examples, including an auto-calibrated predictor built from a GBM. Overall, the method offers a deterministic, regularised alternative to cross-validation for stopping decisions in CART trees with strong theoretical and empirical support. The approach is implemented and made accessible, enabling reliable, non-random tree construction and facilitating downstream use in boosting, distillation, and auto-calibrated prediction tasks.

Abstract

The standard procedure to decide on the complexity of a CART regression tree is to use cross-validation with the aim of obtaining a predictor that generalises well to unseen data. The randomness in the selection of folds implies that the selected CART regression tree is not a deterministic function of the data. Moreover, the cross-validation procedure may become time consuming and result in inefficient use of training data. We propose a simple deterministic in-sample method that can be used for stopping the growing of a CART regression tree based on node-wise statistical tests. This testing procedure is derived using a connection to change point detection, where the null hypothesis corresponds to no signal. The suggested -value based procedure allows us to consider covariate vectors of arbitrary dimension and allows us to bound the -value of an entire tree from above. Further, we show that the test detects a not too weak signal with a high probability, given a not too small sample size. We illustrate our methodology and the asymptotic results on both simulated and real world data. Additionally, we illustrate how the -value based method can be used to construct a deterministic piece-wise constant auto-calibrated predictor based on a given black-box predictor.

Paper Structure

This paper contains 15 sections, 5 theorems, 65 equations, 8 figures, 3 tables.

Key Result

Proposition 1

$\lim_{n\to\infty}\mathbb{P}_{\mathcal{A}^{(n)}}(P_{\max}^{(n)}>\varepsilon)=0$ for every $\varepsilon>0$.

Figures (8)

  • Figure 1: Blue curves: empirical cdf of $U_{\max}$ given $H_0$ computed from 10,000 realisations. Orange curves: Approximation $1 - dp_n(u)$. Left column: $n= 50$, right column: $n=$ 1,000. Top row: independent standard normal covariates, bottom row: dependent normal covariates with common pairwise correlation $\rho = 0.8$ and unit variance. The points of intersection with the dashed blue line illustrate empirical and approximate $0.95$-quantile of $U_{\max}$.
  • Figure 2: Blue curves: Fraction of correct signal detections according to $\{U_{\max}^{(n)} > u_\varepsilon \}$ for an increasing number of data points $n$ and independent standard normal covariates. Orange curve: The analogous fraction based on dependent multivariate normal covariates with common pairwise correlation $\rho = 0.8$ and unit variance. Green curve: The signal strength $|\mu_r-\mu_l| = n^{-1/5}$. The blue dashed line shows the $0.95$-level. The left and right plots correspond to $d=1$ and $d=10$ covariates, respectively.
  • Figure 3: Regression tree corresponding to \ref{['eq:mu_neufeldt']} with $a=1$ adopted from neufeld2022tree. Each left leaf answers the inequality with "true".
  • Figure 4: Left plot: MSEP (blue) and MSE (orange) for each tree in the nested sequence of cost-complexity-pruned subtrees. Right plot: cumulative $p$-value for each tree in the nested sequence of cost-complexity-pruned subtrees. The $x$-axis depicts the number of leaves of the subtree considered. The dashed blue line marks our method's output tree, i.e. the largest subtree whose cumulative $p$-value lies below $\delta = 0.05$. The signal strength parameters are $a=b=1$.
  • Figure 5: Analogue of Figure \ref{['fig:MSEMSEP_neufeld']} with $b=0.5$ instead of $b=1$.
  • ...and 3 more figures

Theorems & Definitions (14)

  • Definition 1: Null hypothesis, $H_0$
  • Definition 2: Alternative hypothesis, $H_A$
  • Proposition 1
  • Proposition 2
  • Remark 3
  • Remark 4
  • proof : Proof of Proposition \ref{['thm:pvaluetozero']}
  • Lemma 5
  • proof
  • Lemma 6
  • ...and 4 more