Table of Contents
Fetching ...

To Bag is to Prune

Philippe Goulet Coulombe

TL;DR

This paper tackles the paradox of Random Forests overfitting in-sample yet performing well out-of-sample by arguing that bootstrap aggregation and feature perturbation induce implicit pruning of a latent true tree through randomized greedy optimization. It formalizes how deep, greedily grown trees averaged across bootstrap samples can mimic an optimally pruned model, effectively implementing early stopping without cross-validation. Extending the idea to Boosting and MARS, it introduces Booging and MARSquake and demonstrates that these ensembles, even when allowed to overfit in-sample, match or exceed tuned counterparts on simulated and real data, aided by data augmentation. The findings suggest a general principle: properly randomized greedy learners can autonomously stop learning at the right point, with practical implications for model design and hyperparameter tuning in complex nonlinear models.

Abstract

It is notoriously difficult to build a bad Random Forest (RF). Concurrently, RF blatantly overfits in-sample without any apparent consequence out-of-sample. Standard arguments, like the classic bias-variance trade-off or double descent, cannot rationalize this paradox. I propose a new explanation: bootstrap aggregation and model perturbation as implemented by RF automatically prune a latent "true" tree. More generally, randomized ensembles of greedily optimized learners implicitly perform optimal early stopping out-of-sample. So there is no need to tune the stopping point. By construction, novel variants of Boosting and MARS are also eligible for automatic tuning. I empirically demonstrate the property, with simulated and real data, by reporting that these new completely overfitting ensembles perform similarly to their tuned counterparts -- or better.

To Bag is to Prune

TL;DR

This paper tackles the paradox of Random Forests overfitting in-sample yet performing well out-of-sample by arguing that bootstrap aggregation and feature perturbation induce implicit pruning of a latent true tree through randomized greedy optimization. It formalizes how deep, greedily grown trees averaged across bootstrap samples can mimic an optimally pruned model, effectively implementing early stopping without cross-validation. Extending the idea to Boosting and MARS, it introduces Booging and MARSquake and demonstrates that these ensembles, even when allowed to overfit in-sample, match or exceed tuned counterparts on simulated and real data, aided by data augmentation. The findings suggest a general principle: properly randomized greedy learners can autonomously stop learning at the right point, with practical implications for model design and hyperparameter tuning in complex nonlinear models.

Abstract

It is notoriously difficult to build a bad Random Forest (RF). Concurrently, RF blatantly overfits in-sample without any apparent consequence out-of-sample. Standard arguments, like the classic bias-variance trade-off or double descent, cannot rationalize this paradox. I propose a new explanation: bootstrap aggregation and model perturbation as implemented by RF automatically prune a latent "true" tree. More generally, randomized ensembles of greedily optimized learners implicitly perform optimal early stopping out-of-sample. So there is no need to tune the stopping point. By construction, novel variants of Boosting and MARS are also eligible for automatic tuning. I empirically demonstrate the property, with simulated and real data, by reporting that these new completely overfitting ensembles perform similarly to their tuned counterparts -- or better.

Paper Structure

This paper contains 23 sections, 25 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: http://archive.ics.uci.edu/ml/datasets/Abalone data set: comparing $R^2_{\text{train}}$ and $R^2_{\text{test}}$. First four models hyperparameters are tuned by 5-fold CV. RF uses default parameters. NN details are in Appendix \ref{['sec:nndetails']}.
  • Figure 2: Model averaging/bagging different base learners with increasingly many useless features. Units are $\ln{(\text{MSE}_{\text{model}}/\text{MSE}_\text{Oracle})}$. Oracle has 10 regressors, SNR=2, and $N=100$. Each prediction is the average over 50 models and 20 bagging replicas. For OLS, bagging is bypassed since it provides the same expectation as using the full sample once. The 50 models are constructed as follows: for each model, I generate $x$ new useless regressors (features made of noise that are not entering the DGP) and add them to the relevant ones, then run estimation. "Greedy OLS" is glmboost in R, setting the learning rate at 1.
  • Figure 3: Dashed lines are true $R^2$, i.e., the best attainable $R^2$ out-of-sample given the variance of the true error. DGP is Friedman 1 MARS simulated using the https://www.rdocumentation.org/packages/mlbench/versions/2.1-3/topics/mlbench.friedman1. Precisely, inputs are 10 independent variables uniformly distributed on the interval $[0,1]$, only 5 out of these 10 enter the DGP so that $y_i = 10 \sin(\pi x_{1,i} x_{2,i}) + 20 (x_{3,i} - 0.5)^2 + 10 x_{4,i}+ 5 x_{5,i} + \epsilon_i$ with $\epsilon_i$ is normal noise (so it is optimal for both algorithms to minimize the squared loss). The $x$-axis is an index of complexity/depth. For RF, it is a decreasing minimal size node from 200 to 1 in 30 steps, and for NN, an increasing number of layers from 1 to 30. The NN is 50 neurons wide and RF's $\texttt{mtry}=1/3$. Other NN details are in Appendix \ref{['sec:nndetails']}.
  • Figure 4: This plots the average hold-out sample $R^2$ between the prediction and the true conditional mean for 30 simulations. The level of noise is calibrated so the SNR is 4. Column facets are DGPs and row facets are base learners. The $x$-axis is an index of depth of the greedy model. For CART, it is a decreasing minimal size node $\in 1.4^{\{16,..,2\}}$, for Boosting, an increasing number of steps $\in 1.5^{\{4,..,18\}}$ and for MARS, it is an increasing number of included terms $\in 1.4^{\{2,..,16\}}$. Both training and test sets have $N=400$.
  • Figure 5: A Subset of Empirical Prediction Results. Performance metric $R^2_{\text{test}}$. Darker green bars means the performance differential between the tuned version and the three others is statistically significant at the 5% level using t-tests (and dieboldmariano tests for time series data). Light green means the difference is not significant at the prescribed level. To enhance visibility in certain cases, $R^2_{\text{test}}$'s below -0.25 are constrained to 0.25.
  • ...and 2 more figures