Table of Contents
Fetching ...

Random Forests as Statistical Procedures: Design, Variance, and Dependence

Nathaniel S. O'Connell

TL;DR

A finite-sample, design-based formulation of random forests in which each tree is an explicit randomized conditional regression function is developed, yielding an exact variance identity for the forest predictor that separates finite-aggregation variability from a structural dependence term that persists even under infinite aggregation.

Abstract

Random forests are widely used prediction procedures, yet are typically described algorithmically rather than as statistical designs acting on a fixed set of covariates. We develop a finite-sample, design-based formulation of random forests in which each tree is an explicit randomized conditional regression function. This perspective yields an exact variance identity for the forest predictor that separates finite-aggregation variability from a structural dependence term that persists even under infinite aggregation. We further decompose both single-tree dispersion and inter-tree covariance using the laws of total variance and covariance, isolating two fundamental design mechanisms-reuse of training observations and alignment of data-adaptive partitions. These mechanisms induce a strict covariance floor, demonstrating that predictive variability cannot be eliminated by increasing the number of trees alone. The resulting framework clarifies how resampling, feature-level randomization, and split selection govern resolution, tree variability, and dependence, and establishes random forests as explicit finite-sample statistical designs whose behavior is determined by their underlying randomized construction.

Random Forests as Statistical Procedures: Design, Variance, and Dependence

TL;DR

A finite-sample, design-based formulation of random forests in which each tree is an explicit randomized conditional regression function is developed, yielding an exact variance identity for the forest predictor that separates finite-aggregation variability from a structural dependence term that persists even under infinite aggregation.

Abstract

Random forests are widely used prediction procedures, yet are typically described algorithmically rather than as statistical designs acting on a fixed set of covariates. We develop a finite-sample, design-based formulation of random forests in which each tree is an explicit randomized conditional regression function. This perspective yields an exact variance identity for the forest predictor that separates finite-aggregation variability from a structural dependence term that persists even under infinite aggregation. We further decompose both single-tree dispersion and inter-tree covariance using the laws of total variance and covariance, isolating two fundamental design mechanisms-reuse of training observations and alignment of data-adaptive partitions. These mechanisms induce a strict covariance floor, demonstrating that predictive variability cannot be eliminated by increasing the number of trees alone. The resulting framework clarifies how resampling, feature-level randomization, and split selection govern resolution, tree variability, and dependence, and establishes random forests as explicit finite-sample statistical designs whose behavior is determined by their underlying randomized construction.
Paper Structure (38 sections, 6 theorems, 73 equations, 2 figures)

This paper contains 38 sections, 6 theorems, 73 equations, 2 figures.

Key Result

Theorem 1

For any $B \ge 1$,

Figures (2)

  • Figure 1: Random forests as randomized local averaging on fixed outcomes. Independent draws of the tree-generating design $\theta_1,\theta_2,\theta_3$ induce realized tree structures and tree specific terminal-node membership sets $A_{\theta_b}(x)$ for prediction point $x$. Each tree prediction is the average over indexed outcomes in its membership set, and averaging over independent draws yields $\hat{f}_B(x)$ and $f_\infty(x)$.
  • Figure 2: Two distinct design-induced dependence mechanisms at a fixed prediction point $x$. (A) Observation overlap: independently generated trees reuse the same outcome (here, $Y_5$) in their terminal-node averages at $x$, inducing dependence through shared weighted outcomes. (B) Partition alignment without overlap: trees are grown on disjoint training sets ($\{1,2,3,4\}$ and $\{6,7,8,9\}$), yet the covariate geometry near $x$ drives both trees to discover the same splits ($X_1<c$, then $X_3<d$). The prediction point $x$ routes identically through both trees (dashed red paths), landing in structurally equivalent terminal regions. The resulting predictions $T_\theta(x)$ and $T_{\theta'}(x)$ average different observations drawn from the same neighborhood of $x$, producing dependence through aligned local averaging rules rather than shared outcomes.

Theorems & Definitions (18)

  • Theorem 1: Finite-sample variance identity for random forests
  • proof
  • Theorem 2: Strict positivity of the covariance floor under observation reuse
  • Remark 1: Design interpretation
  • Proposition 1: Uniform outcome stability conditional on tree structure
  • proof
  • Remark 2: Contrast with coefficient-based predictors
  • Definition 1: Design-based resolution
  • Remark 3: Interpretation
  • Remark 4: Aggregation controls Monte Carlo variability
  • ...and 8 more