Table of Contents
Fetching ...

When do Random Forests work?

C. Revelas, O. Boldea, B. J. M. Werker

TL;DR

The paper analyzes when split randomization in random forests improves out-of-sample performance relative to bagging by decomposing the mean-squared error into bias, variance, and irreducible error across varied data characteristics. It shows that decorrelation from split randomness reduces variance but can inflate bias, and that the overall gain depends on the signal-to-noise ratio and the underlying covariate structure. A normalized framework demonstrates that relative performance is invariant to rescaling of the regression function, enabling robust comparisons; beyond SNR, tails, irrelevant covariates, and covariate correlations critically shape outcomes. The study reveals new insights, notably that correlated covariates tend to reduce bias and amplify the benefits of randomization, while tails and irrelevancies can undermine forest performance, guiding practical choices in real-world modeling contexts.

Abstract

We study the effectiveness of randomizing split-directions in random forests. Prior literature has shown that, on the one hand, randomization can reduce variance through decorrelation, and, on the other hand, randomization regularizes and works in low signal-to-noise ratio (SNR) environments. First, we bring together and revisit decorrelation and regularization by presenting a systematic analysis of out-of-sample mean-squared error (MSE) for different SNR scenarios based on commonly-used data-generating processes. We find that variance reduction tends to increase with the SNR and forests outperform bagging when the SNR is low because, in low SNR cases, variance dominates bias for both methods. Second, we show that the effectiveness of randomization is a question that goes beyond the SNR. We present a simulation study with fixed and moderate SNR, in which we examine the effectiveness of randomization for other data characteristics. In particular, we find that (i) randomization can increase bias in the presence of fat tails in the distribution of covariates; (ii) in the presence of irrelevant covariates randomization is ineffective because bias dominates variance; and (iii) when covariates are mutually correlated randomization tends to be effective because variance dominates bias. Beyond randomization, we find that, for both bagging and random forests, bias can be significantly reduced in the presence of correlated covariates. This last finding goes beyond the prevailing view that averaging mostly works by variance reduction. Given that in practice covariates are often correlated, our findings on correlated covariates could open the way for a better understanding of why random forests work well in many applications.

When do Random Forests work?

TL;DR

The paper analyzes when split randomization in random forests improves out-of-sample performance relative to bagging by decomposing the mean-squared error into bias, variance, and irreducible error across varied data characteristics. It shows that decorrelation from split randomness reduces variance but can inflate bias, and that the overall gain depends on the signal-to-noise ratio and the underlying covariate structure. A normalized framework demonstrates that relative performance is invariant to rescaling of the regression function, enabling robust comparisons; beyond SNR, tails, irrelevant covariates, and covariate correlations critically shape outcomes. The study reveals new insights, notably that correlated covariates tend to reduce bias and amplify the benefits of randomization, while tails and irrelevancies can undermine forest performance, guiding practical choices in real-world modeling contexts.

Abstract

We study the effectiveness of randomizing split-directions in random forests. Prior literature has shown that, on the one hand, randomization can reduce variance through decorrelation, and, on the other hand, randomization regularizes and works in low signal-to-noise ratio (SNR) environments. First, we bring together and revisit decorrelation and regularization by presenting a systematic analysis of out-of-sample mean-squared error (MSE) for different SNR scenarios based on commonly-used data-generating processes. We find that variance reduction tends to increase with the SNR and forests outperform bagging when the SNR is low because, in low SNR cases, variance dominates bias for both methods. Second, we show that the effectiveness of randomization is a question that goes beyond the SNR. We present a simulation study with fixed and moderate SNR, in which we examine the effectiveness of randomization for other data characteristics. In particular, we find that (i) randomization can increase bias in the presence of fat tails in the distribution of covariates; (ii) in the presence of irrelevant covariates randomization is ineffective because bias dominates variance; and (iii) when covariates are mutually correlated randomization tends to be effective because variance dominates bias. Beyond randomization, we find that, for both bagging and random forests, bias can be significantly reduced in the presence of correlated covariates. This last finding goes beyond the prevailing view that averaging mostly works by variance reduction. Given that in practice covariates are often correlated, our findings on correlated covariates could open the way for a better understanding of why random forests work well in many applications.

Paper Structure

This paper contains 25 sections, 3 theorems, 27 equations, 8 figures, 13 tables.

Key Result

Proposition 1

For $X$, $\varepsilon$, and $Y$ satisfying (regression_model) and any estimator $\hat{f}$ which is independent of $X$ and $\varepsilon$, we have

Figures (8)

  • Figure 1: Difference in conditional MSE for $\mathcal{U}$-MARS (top row) and $\mathcal{N}$-MARS (bottom).
  • Figure 2: Difference in conditional squared bias and variance for $\mathcal{N}$-MARS.
  • Figure 3: Effect of irrelevant covariates for $\mathcal{U}$-MARS.
  • Figure 4: Effect of correlated covariates for $\mathcal{N}$-MARS.
  • Figure 5: Effect of irrelevant covariates for $\mathcal{N}$-LINEAR.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Proposition 1: bias - variance decomposition
  • Proposition 2
  • Proposition 3