Regularisation of CART trees by summation of $p$-values
Nils Engler, Mathias Lindholm, Filip Lindskog, Taariq Nazar
TL;DR
The paper addresses the instability and inefficiency of cross-validation for growing CART regression trees by introducing a deterministic, in-sample stopping rule based on node-wise $p$-values. By formulating the split decision as a change-point problem and employing a Bonferroni-based bound, the authors provide an easily computable upper bound for the tree-wide $p$-value that allows covariates of arbitrary dimension and yields asymptotic power guarantees under meaningful alternatives. They connect this approach to classical regularisation techniques, showing how the $p$-value penalty relates to pruning and information criteria, and demonstrate practical performance through extensive simulations and real-data examples, including an auto-calibrated predictor built from a GBM. Overall, the method offers a deterministic, regularised alternative to cross-validation for stopping decisions in CART trees with strong theoretical and empirical support. The approach is implemented and made accessible, enabling reliable, non-random tree construction and facilitating downstream use in boosting, distillation, and auto-calibrated prediction tasks.
Abstract
The standard procedure to decide on the complexity of a CART regression tree is to use cross-validation with the aim of obtaining a predictor that generalises well to unseen data. The randomness in the selection of folds implies that the selected CART regression tree is not a deterministic function of the data. Moreover, the cross-validation procedure may become time consuming and result in inefficient use of training data. We propose a simple deterministic in-sample method that can be used for stopping the growing of a CART regression tree based on node-wise statistical tests. This testing procedure is derived using a connection to change point detection, where the null hypothesis corresponds to no signal. The suggested $p$-value based procedure allows us to consider covariate vectors of arbitrary dimension and allows us to bound the $p$-value of an entire tree from above. Further, we show that the test detects a not too weak signal with a high probability, given a not too small sample size. We illustrate our methodology and the asymptotic results on both simulated and real world data. Additionally, we illustrate how the $p$-value based method can be used to construct a deterministic piece-wise constant auto-calibrated predictor based on a given black-box predictor.
