Table of Contents
Fetching ...

Multivariate Gaussian Approximation for Random Forest via Region-based Stabilization

Zhaoyang Shi, Chinmoy Bhattacharjee, Krishnakumar Balasubramanian, Wolfgang Polonik

TL;DR

The paper develops non-asymptotic, multivariate Gaussian approximation bounds for non-bagging, non-adaptive random forests built from k-Potential Nearest Neighbors (k-PNN) predictors under a Poisson sampling model. The central idea is region-based stabilization of score functions, enabling Malliavin-Stein based Gaussian limits with rates depending on the growth of k and the input dimension d; a detailed bound is given for the distance between the forest predictions and a multivariate normal, with explicit dependence on the stabilization geometry and moments. A key finding is the universality between k-PNN and k-NN forests: k-NN forests are a special case of k-PNN, but k-PNN exhibits rectangular, long-range dependence that necessitates region-based stabilization techniques and yields logarithmic-in-n rates. The paper also provides a general probabilistic result for Gaussian approximation of Poisson-functionals with region-stabilizing scores, laying groundwork for broader applicability to related statistical problems and potential extensions to adaptive forests and other regression procedures. Overall, the results offer finite-sample guarantees for multivariate Gaussian behavior of random forest predictions under weak smoothness and moment conditions, highlighting the practical relevance for inference in high-dimensional nonparametric regression.

Abstract

We derive Gaussian approximation bounds for $k$-Potential Nearest Neighbor ($k$-PNN) based random forest predictions based on a set of training points given by a Poisson process under fairly mild regularity assumptions on the data generating process. Our approach is based on the key observation that $k$-PNN based random forest predictions satisfy a certain geometric property called region-based stabilization. We also compare the rates with those of $k$-nearest neighbor-based random forests, highlighting a form of universality in our result. In the process of developing our results, we also establish a probabilistic result on multivariate Gaussian approximation bounds for general functionals of Poisson process that are region-based stabilizing. This general result makes use of the Malliavin-Stein method, and is potentially applicable to various related statistical problems.

Multivariate Gaussian Approximation for Random Forest via Region-based Stabilization

TL;DR

The paper develops non-asymptotic, multivariate Gaussian approximation bounds for non-bagging, non-adaptive random forests built from k-Potential Nearest Neighbors (k-PNN) predictors under a Poisson sampling model. The central idea is region-based stabilization of score functions, enabling Malliavin-Stein based Gaussian limits with rates depending on the growth of k and the input dimension d; a detailed bound is given for the distance between the forest predictions and a multivariate normal, with explicit dependence on the stabilization geometry and moments. A key finding is the universality between k-PNN and k-NN forests: k-NN forests are a special case of k-PNN, but k-PNN exhibits rectangular, long-range dependence that necessitates region-based stabilization techniques and yields logarithmic-in-n rates. The paper also provides a general probabilistic result for Gaussian approximation of Poisson-functionals with region-stabilizing scores, laying groundwork for broader applicability to related statistical problems and potential extensions to adaptive forests and other regression procedures. Overall, the results offer finite-sample guarantees for multivariate Gaussian behavior of random forest predictions under weak smoothness and moment conditions, highlighting the practical relevance for inference in high-dimensional nonparametric regression.

Abstract

We derive Gaussian approximation bounds for -Potential Nearest Neighbor (-PNN) based random forest predictions based on a set of training points given by a Poisson process under fairly mild regularity assumptions on the data generating process. Our approach is based on the key observation that -PNN based random forest predictions satisfy a certain geometric property called region-based stabilization. We also compare the rates with those of -nearest neighbor-based random forests, highlighting a form of universality in our result. In the process of developing our results, we also establish a probabilistic result on multivariate Gaussian approximation bounds for general functionals of Poisson process that are region-based stabilizing. This general result makes use of the Malliavin-Stein method, and is potentially applicable to various related statistical problems.
Paper Structure (23 sections, 21 theorems, 280 equations, 1 figure)

This paper contains 23 sections, 21 theorems, 280 equations, 1 figure.

Key Result

Theorem 3.1

Assume there exist $p>0$ and $\sigma^2 > 0$ such that For $m \in \mathbb{N}$ and $x_{0,i} \in \mathbb{R}^d, i = 1,\ldots,m,$ let $\mathbf{r}_{n,k,w}$ be as in eq:rnkw, with covariance matrix $\Sigma_{m}$. Then, for $d, n\ge 2$ and $k=\mathcal{O}(n^{\alpha})$ for $0<\alpha<1$, there exists $c_{g}>0$ depending on $d$, $\sigma^{2}$, $g$, $\alpha$ and $p>0 for $\mathsf{d}\in \{\mathsf{d}_{2},\mathsf{

Figures (1)

  • Figure 1: The set of 2-PNNs around a point $x_0 \in \mathbb{R}^2$. The point configuration includes all points in the figure except $x_0$. The blue and red points together are the 2-PNNs to $x_0$. The red ones such as $\bm{x}_2$ has exactly 1 point in its corresponding rectangle. The blue ones, such as $\bm{x}_1$, are also a 1-PNN, or LNN, with no other point in the rectangle formed by $x_0$ and those points.

Theorems & Definitions (57)

  • Definition 2.1: $k$-PNN
  • Theorem 3.1
  • Remark 3.1
  • Remark 3.2: De-localization of weights
  • Remark 3.3
  • Corollary 3.1
  • Remark 3.4
  • Remark 3.5: Moment condition
  • Remark 3.6: Binomial Point Processes
  • Remark 3.7: Comparison to MSE rates
  • ...and 47 more