To Bag is to Prune
Philippe Goulet Coulombe
TL;DR
This paper tackles the paradox of Random Forests overfitting in-sample yet performing well out-of-sample by arguing that bootstrap aggregation and feature perturbation induce implicit pruning of a latent true tree through randomized greedy optimization. It formalizes how deep, greedily grown trees averaged across bootstrap samples can mimic an optimally pruned model, effectively implementing early stopping without cross-validation. Extending the idea to Boosting and MARS, it introduces Booging and MARSquake and demonstrates that these ensembles, even when allowed to overfit in-sample, match or exceed tuned counterparts on simulated and real data, aided by data augmentation. The findings suggest a general principle: properly randomized greedy learners can autonomously stop learning at the right point, with practical implications for model design and hyperparameter tuning in complex nonlinear models.
Abstract
It is notoriously difficult to build a bad Random Forest (RF). Concurrently, RF blatantly overfits in-sample without any apparent consequence out-of-sample. Standard arguments, like the classic bias-variance trade-off or double descent, cannot rationalize this paradox. I propose a new explanation: bootstrap aggregation and model perturbation as implemented by RF automatically prune a latent "true" tree. More generally, randomized ensembles of greedily optimized learners implicitly perform optimal early stopping out-of-sample. So there is no need to tune the stopping point. By construction, novel variants of Boosting and MARS are also eligible for automatic tuning. I empirically demonstrate the property, with simulated and real data, by reporting that these new completely overfitting ensembles perform similarly to their tuned counterparts -- or better.
