Clustered random forests with correlated data for optimal estimation and inference under potential covariate shift
Elliot H. Young, Peter Bühlmann
TL;DR
The paper develops Clustered Random Forests (CRF) to handle clustered data with within-cluster dependence by embedding cluster-aware weights $W_i( ho)$ into leaf-wise weighted least squares predictions. It proves minimax-type pointwise MSE rates and asymptotic Gaussianity for variance-dominating CRFs, and shows that optimal weights under covariate shift depend on the target covariate distribution $Q$ through the quantity $\mathcal{V}_Q(\rho)$. CRFs are shown to admit approximately linear-time fitting for weight structures like equicorrelated and AR(1), while providing valid inference via asymptotic normality and variance estimators. Numerical experiments, including a real-data CD4 cell count analysis in HIV patients, demonstrate substantial variance reduction and tighter confidence intervals under covariate shift, underscoring the practical impact of covariate-shift adaptive weighting in correlated data settings.
Abstract
We develop Clustered Random Forests, a random forests algorithm for clustered data, arising from independent groups that exhibit within-cluster dependence. The leaf-wise predictions for each decision tree making up clustered random forests takes the form of a weighted least squares estimator, which leverage correlations between observations for improved prediction accuracy and tighter confidence intervals when performing inference. We show that approximately linear time algorithms exist for fitting classes of clustered random forests, matching the computational complexity of standard random forests. Further, we observe that the optimality of a clustered random forest, with regards to how optimal weights are chosen within this framework i.e. those that minimise mean squared prediction error, vary under covariate distribution shift. In light of this, we advocate weight estimation to be determined by a user-chosen covariate distribution, or test dataset of covariates, with respect to which optimal prediction or inference is desired. This highlights a key distinction between correlated and independent data with regards to optimality of nonparametric conditional mean estimation under covariate shift. We demonstrate our theoretical findings numerically in a number of simulated and real-world settings.
