Table of Contents
Fetching ...

Clustered random forests with correlated data for optimal estimation and inference under potential covariate shift

Elliot H. Young, Peter Bühlmann

TL;DR

The paper develops Clustered Random Forests (CRF) to handle clustered data with within-cluster dependence by embedding cluster-aware weights $W_i( ho)$ into leaf-wise weighted least squares predictions. It proves minimax-type pointwise MSE rates and asymptotic Gaussianity for variance-dominating CRFs, and shows that optimal weights under covariate shift depend on the target covariate distribution $Q$ through the quantity $\mathcal{V}_Q(\rho)$. CRFs are shown to admit approximately linear-time fitting for weight structures like equicorrelated and AR(1), while providing valid inference via asymptotic normality and variance estimators. Numerical experiments, including a real-data CD4 cell count analysis in HIV patients, demonstrate substantial variance reduction and tighter confidence intervals under covariate shift, underscoring the practical impact of covariate-shift adaptive weighting in correlated data settings.

Abstract

We develop Clustered Random Forests, a random forests algorithm for clustered data, arising from independent groups that exhibit within-cluster dependence. The leaf-wise predictions for each decision tree making up clustered random forests takes the form of a weighted least squares estimator, which leverage correlations between observations for improved prediction accuracy and tighter confidence intervals when performing inference. We show that approximately linear time algorithms exist for fitting classes of clustered random forests, matching the computational complexity of standard random forests. Further, we observe that the optimality of a clustered random forest, with regards to how optimal weights are chosen within this framework i.e. those that minimise mean squared prediction error, vary under covariate distribution shift. In light of this, we advocate weight estimation to be determined by a user-chosen covariate distribution, or test dataset of covariates, with respect to which optimal prediction or inference is desired. This highlights a key distinction between correlated and independent data with regards to optimality of nonparametric conditional mean estimation under covariate shift. We demonstrate our theoretical findings numerically in a number of simulated and real-world settings.

Clustered random forests with correlated data for optimal estimation and inference under potential covariate shift

TL;DR

The paper develops Clustered Random Forests (CRF) to handle clustered data with within-cluster dependence by embedding cluster-aware weights into leaf-wise weighted least squares predictions. It proves minimax-type pointwise MSE rates and asymptotic Gaussianity for variance-dominating CRFs, and shows that optimal weights under covariate shift depend on the target covariate distribution through the quantity . CRFs are shown to admit approximately linear-time fitting for weight structures like equicorrelated and AR(1), while providing valid inference via asymptotic normality and variance estimators. Numerical experiments, including a real-data CD4 cell count analysis in HIV patients, demonstrate substantial variance reduction and tighter confidence intervals under covariate shift, underscoring the practical impact of covariate-shift adaptive weighting in correlated data settings.

Abstract

We develop Clustered Random Forests, a random forests algorithm for clustered data, arising from independent groups that exhibit within-cluster dependence. The leaf-wise predictions for each decision tree making up clustered random forests takes the form of a weighted least squares estimator, which leverage correlations between observations for improved prediction accuracy and tighter confidence intervals when performing inference. We show that approximately linear time algorithms exist for fitting classes of clustered random forests, matching the computational complexity of standard random forests. Further, we observe that the optimality of a clustered random forest, with regards to how optimal weights are chosen within this framework i.e. those that minimise mean squared prediction error, vary under covariate distribution shift. In light of this, we advocate weight estimation to be determined by a user-chosen covariate distribution, or test dataset of covariates, with respect to which optimal prediction or inference is desired. This highlights a key distinction between correlated and independent data with regards to optimality of nonparametric conditional mean estimation under covariate shift. We demonstrate our theoretical findings numerically in a number of simulated and real-world settings.

Paper Structure

This paper contains 31 sections, 15 theorems, 213 equations, 6 figures, 4 tables, 3 algorithms.

Key Result

Theorem 1

Let $\hat{\mu}_I^{\mathrm{MC}}$ be a clustered random forest estimator eq:MC satisfying ass:tree and ass:weights, trained on $I$ clusters drawn from a law $P$ satisfying ass:data, with $k_I^{-1}I^{-1}s_I\to0$ and $k_I^{-1}n_{\mathrm{c}}s_I\to\infty$ as $I\to\infty$. Also fix an arbitrarily small con where $C_{\mathrm{bias}}:=\frac{6C_WL_\mu d^{1/2}\nu}{c_W}$, and is the $Q$-integrated variance of

Figures (6)

  • Figure 1: Bias--variance tradeoff for random forests, subsampling $s_I=I^\beta$ clusters per tree. Clustered random forests provide a reduction in variance, and by consequence reduction in mean squared error. See Appendix \ref{['appsec:introsim']} for further details.
  • Figure 2: Pointwise MSE as a function of the minimum node size $k_I$ and subsampling fraction $I^{-1}s_I$ for the simulation of Figure \ref{['fig:sim1']} with $I=10^4$. A valley runs along $k_I^{-1}s_I\approx30$. See Appendix \ref{['appsec:introsim']} for further details.
  • Figure 3: The training (blue) and testing (red) mean squared prediction errors (MSPE) for clustered random forests, weighted by an equicorrelated structure with fixed equicorrelation parameter $\rho\in[0,0.9]$ for the setting of Simulation \ref{['sec:sim2']}. The optimal clustered random forest with respect to training MSPE (corresponding to $\rho\approx 0.55$) can be seen to provide worse testing MSPE than any clustered random forests with any $\rho\in[0,0.55)$, including worse testing MSPE over even standard honest random forests (obtained by taking $\rho=0$).
  • Figure 4: MSPE of standard and clustered random forests (Algorithm \ref{['alg:crf']}) for a varying number of trees $B\in[1,1000]$ making up the random forests.
  • Figure 5: Gaussian QQ plots for predictions $\hat{\mu}({\bf1})$ in the simulations of Section \ref{['sec:sim3']}, for each covariate dimension $d\in\{1,3,5,10,50\}$.
  • ...and 1 more figures

Theorems & Definitions (32)

  • Theorem 1
  • Theorem 2: Covariate shift
  • Proposition 3
  • Remark 4: Approaching minimax rates
  • Theorem 5: Asymptotic Normality of variance-dominating CRFs
  • Proposition 6: Approximate linear time clustered random forest fitting
  • Lemma 7: Asymptotic unbiasedness of clustered random forests
  • proof : Proof of Lemma \ref{['lem:unbiasedness']}
  • Lemma 8
  • proof
  • ...and 22 more