Table of Contents
Fetching ...

Why do Random Forests Work? Understanding Tree Ensembles as Self-Regularizing Adaptive Smoothers

Alicia Curth, Alan Jeffares, Mihaela van der Schaar

TL;DR

The paper reframes tree ensembles as adaptive, self-regularizing smoothers and introduces a quantitative measure, p^{0}_{\hat{\boldsymbol{s}}}, of effective smoothing to compare predictions at training versus unseen inputs. By empirically and theoretically analyzing both interpolation and randomness-based regularization, the authors reconcile Wyner et al.'s spiked-smooth hypothesis with Mentch et al.'s randomness-as-regularization view, showing that the observed smoothing differences arise from ensemble-induced variability in smoothing across inputs. The results demonstrate three distinct mechanisms—reducing sampling variance, lowering model variability, and enriching the representation space—that collectively explain why forests generalize better than single trees, particularly under train-test dissimilarity and varying signal-to-noise ratios. Practically, the work provides guidance on how to tune randomness (e.g., bootstrapping, feature subsampling) to optimize test-time smoothing and generalization in real-world tabular data settings.

Abstract

Despite their remarkable effectiveness and broad application, the drivers of success underlying ensembles of trees are still not fully understood. In this paper, we highlight how interpreting tree ensembles as adaptive and self-regularizing smoothers can provide new intuition and deeper insight to this topic. We use this perspective to show that, when studied as smoothers, randomized tree ensembles not only make predictions that are quantifiably more smooth than the predictions of the individual trees they consist of, but also further regulate their smoothness at test-time based on the dissimilarity between testing and training inputs. First, we use this insight to revisit, refine and reconcile two recent explanations of forest success by providing a new way of quantifying the conjectured behaviors of tree ensembles objectively by measuring the effective degree of smoothing they imply. Then, we move beyond existing explanations for the mechanisms by which tree ensembles improve upon individual trees and challenge the popular wisdom that the superior performance of forests should be understood as a consequence of variance reduction alone. We argue that the current high-level dichotomy into bias- and variance-reduction prevalent in statistics is insufficient to understand tree ensembles -- because the prevailing definition of bias does not capture differences in the expressivity of the hypothesis classes formed by trees and forests. Instead, we show that forests can improve upon trees by three distinct mechanisms that are usually implicitly entangled. In particular, we demonstrate that the smoothing effect of ensembling can reduce variance in predictions due to noise in outcome generation, reduce variability in the quality of the learned function given fixed input data and reduce potential bias in learnable functions by enriching the available hypothesis space.

Why do Random Forests Work? Understanding Tree Ensembles as Self-Regularizing Adaptive Smoothers

TL;DR

The paper reframes tree ensembles as adaptive, self-regularizing smoothers and introduces a quantitative measure, p^{0}_{\hat{\boldsymbol{s}}}, of effective smoothing to compare predictions at training versus unseen inputs. By empirically and theoretically analyzing both interpolation and randomness-based regularization, the authors reconcile Wyner et al.'s spiked-smooth hypothesis with Mentch et al.'s randomness-as-regularization view, showing that the observed smoothing differences arise from ensemble-induced variability in smoothing across inputs. The results demonstrate three distinct mechanisms—reducing sampling variance, lowering model variability, and enriching the representation space—that collectively explain why forests generalize better than single trees, particularly under train-test dissimilarity and varying signal-to-noise ratios. Practically, the work provides guidance on how to tune randomness (e.g., bootstrapping, feature subsampling) to optimize test-time smoothing and generalization in real-world tabular data settings.

Abstract

Despite their remarkable effectiveness and broad application, the drivers of success underlying ensembles of trees are still not fully understood. In this paper, we highlight how interpreting tree ensembles as adaptive and self-regularizing smoothers can provide new intuition and deeper insight to this topic. We use this perspective to show that, when studied as smoothers, randomized tree ensembles not only make predictions that are quantifiably more smooth than the predictions of the individual trees they consist of, but also further regulate their smoothness at test-time based on the dissimilarity between testing and training inputs. First, we use this insight to revisit, refine and reconcile two recent explanations of forest success by providing a new way of quantifying the conjectured behaviors of tree ensembles objectively by measuring the effective degree of smoothing they imply. Then, we move beyond existing explanations for the mechanisms by which tree ensembles improve upon individual trees and challenge the popular wisdom that the superior performance of forests should be understood as a consequence of variance reduction alone. We argue that the current high-level dichotomy into bias- and variance-reduction prevalent in statistics is insufficient to understand tree ensembles -- because the prevailing definition of bias does not capture differences in the expressivity of the hypothesis classes formed by trees and forests. Instead, we show that forests can improve upon trees by three distinct mechanisms that are usually implicitly entangled. In particular, we demonstrate that the smoothing effect of ensembling can reduce variance in predictions due to noise in outcome generation, reduce variability in the quality of the learned function given fixed input data and reduce potential bias in learnable functions by enriching the available hypothesis space.
Paper Structure (43 sections, 14 equations, 27 figures)

This paper contains 43 sections, 14 equations, 27 figures.

Figures (27)

  • Figure 1: Illustrating when, where and why averaging ensembles can make smoother predictions than individual regression trees.Consider a stylized example with two input-output pairs $(x_{train, i}, y_{train,i})$ to which we fit decision trees of full depth to learn to make predictions $\hat{y}$. This problem is underspecified -- the splits with decision boundaries displayed in panels (A) and (B) lead to the same impurity decrease. Which split will be realized is thus a random choice. Both trees make the same prediction for training inputs $x_{train, i}$. For the previously unseen test-input $x_{test}$, however, the two trees will issue different predictions. An ensemble of these trees, as displayed in Panel (C), will therefore (i) make the same prediction as the individual trees close to training examples but (ii) will make a smoother prediction than each individual tree in regions where the decision boundary is underdetermined. That is, while individual full-depth trees always act as 1-Nearest Neighbor estimators (with learned distance metric), an ensemble of 2 full-depth trees will act as a 1-Nearest Neighbor estimators around training examples, but can self-regularize to act as a 2-Nearest Neighbor estimator in underdetermined regions at test-time.
  • Figure 2: Understanding the performance of interpolating tree ensembles.Effective parameters (left), effective number of nearest neighbors (middle) and generalization error (right) by number of trees for forests of full-depth trees trained without bootstrapping.
  • Figure 3: Generalization error by $m$ for ensembles of 50 interpolating trees.
  • Figure 4: The smoothing effect of ensembling for trees of different depth.Training error (A), train-time effective parameters (EPs, B), test-time effective parameters (C), the difference (gap) between train-time and test-time effective parameters (D) and generalization error (E) by number of trees for forests of trees of different depths trained without bootstrap and with $m=\frac{1}{3}$.
  • Figure 5: Train- and test-time effective parameters for boosted ensemblesusing gradient boosting of trees of different depths with learning rate $\gamma=.05$.
  • ...and 22 more figures