Asymptotic Theory and Phase Transitions for Variable Importance in Quantile Regression Forests
Tomoshige Nakamura, Hiroshi Shiraishi
TL;DR
Addresses statistical inference for variable importance in Quantile Regression Forests (QRF), complicated by non-smooth pinball loss and high-dimensional bias. Develops an asymptotic theory for QRF estimators using Knight's identity, and reveals a phase transition in variable-importance inference governed by the subsampling rate β. In the bias-dominated regime (β≥1/2), VI converges to a deterministic bias, with explicit bias term derived and bias-correction proposed to restore asymptotic normality. The work highlights a fundamental inference-prediction trade-off in high-dimensional random-forest settings and provides a pathway toward valid inference via analytic bias correction.
Abstract
Quantile Regression Forests (QRF) are widely used for non-parametric conditional quantile estimation, yet statistical inference for variable importance measures remains challenging due to the non-smoothness of the loss function and the complex bias-variance trade-off. In this paper, we develop a asymptotic theory for variable importance defined as the difference in pinball loss risks. We first establish the asymptotic normality of the QRF estimator by handling the non-differentiable pinball loss via Knight's identity. Second, we uncover a "phase transition" phenomenon governed by the subsampling rate $β$ (where $s \asymp n^β$). We prove that in the bias-dominated regime ($β\ge 1/2$), which corresponds to large subsample sizes typically favored in practice to maximize predictive accuracy, standard inference breaks down as the estimator converges to a deterministic bias constant rather than a zero-mean normal distribution. Finally, we derive the explicit analytic form of this asymptotic bias and discuss the theoretical feasibility of restoring valid inference via analytic bias correction. Our results highlight a fundamental trade-off between predictive performance and inferential validity, providing a theoretical foundation for understanding the intrinsic limitations of random forest inference in high-dimensional settings.
