Random Forests as Statistical Procedures: Design, Variance, and Dependence

Nathaniel S. O'Connell

Random Forests as Statistical Procedures: Design, Variance, and Dependence

Nathaniel S. O'Connell

TL;DR

A finite-sample, design-based formulation of random forests in which each tree is an explicit randomized conditional regression function is developed, yielding an exact variance identity for the forest predictor that separates finite-aggregation variability from a structural dependence term that persists even under infinite aggregation.

Abstract

Random forests are widely used prediction procedures, yet are typically described algorithmically rather than as statistical designs acting on a fixed set of covariates. We develop a finite-sample, design-based formulation of random forests in which each tree is an explicit randomized conditional regression function. This perspective yields an exact variance identity for the forest predictor that separates finite-aggregation variability from a structural dependence term that persists even under infinite aggregation. We further decompose both single-tree dispersion and inter-tree covariance using the laws of total variance and covariance, isolating two fundamental design mechanisms-reuse of training observations and alignment of data-adaptive partitions. These mechanisms induce a strict covariance floor, demonstrating that predictive variability cannot be eliminated by increasing the number of trees alone. The resulting framework clarifies how resampling, feature-level randomization, and split selection govern resolution, tree variability, and dependence, and establishes random forests as explicit finite-sample statistical designs whose behavior is determined by their underlying randomized construction.

Random Forests as Statistical Procedures: Design, Variance, and Dependence

TL;DR

Abstract

Paper Structure (38 sections, 6 theorems, 73 equations, 2 figures)

This paper contains 38 sections, 6 theorems, 73 equations, 2 figures.

Introduction
Existing theory and its limitations
A design-based perspective
Random Forests as Statistical Procedures
Variance decomposition and scope
Tree-level randomized regression functions
The forest predictor and its design-based target
Variance of Random Forest Predictors
Finite-sample variance identity
Decomposing the single-tree variance
Decomposing the within-resample variance term
Decomposing the resampling component
Decomposing the Covariance Structure
Law of Total Covariance Decomposition
Covariance Induced by Shared Training Observations
...and 23 more sections

Key Result

Theorem 1

For any $B \ge 1$,

Figures (2)

Figure 1: Random forests as randomized local averaging on fixed outcomes. Independent draws of the tree-generating design $\theta_1,\theta_2,\theta_3$ induce realized tree structures and tree specific terminal-node membership sets $A_{\theta_b}(x)$ for prediction point $x$. Each tree prediction is the average over indexed outcomes in its membership set, and averaging over independent draws yields $\hat{f}_B(x)$ and $f_\infty(x)$.
Figure 2: Two distinct design-induced dependence mechanisms at a fixed prediction point $x$. (A) Observation overlap: independently generated trees reuse the same outcome (here, $Y_5$) in their terminal-node averages at $x$, inducing dependence through shared weighted outcomes. (B) Partition alignment without overlap: trees are grown on disjoint training sets ($\{1,2,3,4\}$ and $\{6,7,8,9\}$), yet the covariate geometry near $x$ drives both trees to discover the same splits ($X_1<c$, then $X_3<d$). The prediction point $x$ routes identically through both trees (dashed red paths), landing in structurally equivalent terminal regions. The resulting predictions $T_\theta(x)$ and $T_{\theta'}(x)$ average different observations drawn from the same neighborhood of $x$, producing dependence through aligned local averaging rules rather than shared outcomes.

Theorems & Definitions (18)

Theorem 1: Finite-sample variance identity for random forests
proof
Theorem 2: Strict positivity of the covariance floor under observation reuse
Remark 1: Design interpretation
Proposition 1: Uniform outcome stability conditional on tree structure
proof
Remark 2: Contrast with coefficient-based predictors
Definition 1: Design-based resolution
Remark 3: Interpretation
Remark 4: Aggregation controls Monte Carlo variability
...and 8 more

Random Forests as Statistical Procedures: Design, Variance, and Dependence

TL;DR

Abstract

Random Forests as Statistical Procedures: Design, Variance, and Dependence

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (18)