Table of Contents
Fetching ...

Asymptotic Theory and Phase Transitions for Variable Importance in Quantile Regression Forests

Tomoshige Nakamura, Hiroshi Shiraishi

TL;DR

Addresses statistical inference for variable importance in Quantile Regression Forests (QRF), complicated by non-smooth pinball loss and high-dimensional bias. Develops an asymptotic theory for QRF estimators using Knight's identity, and reveals a phase transition in variable-importance inference governed by the subsampling rate β. In the bias-dominated regime (β≥1/2), VI converges to a deterministic bias, with explicit bias term derived and bias-correction proposed to restore asymptotic normality. The work highlights a fundamental inference-prediction trade-off in high-dimensional random-forest settings and provides a pathway toward valid inference via analytic bias correction.

Abstract

Quantile Regression Forests (QRF) are widely used for non-parametric conditional quantile estimation, yet statistical inference for variable importance measures remains challenging due to the non-smoothness of the loss function and the complex bias-variance trade-off. In this paper, we develop a asymptotic theory for variable importance defined as the difference in pinball loss risks. We first establish the asymptotic normality of the QRF estimator by handling the non-differentiable pinball loss via Knight's identity. Second, we uncover a "phase transition" phenomenon governed by the subsampling rate $β$ (where $s \asymp n^β$). We prove that in the bias-dominated regime ($β\ge 1/2$), which corresponds to large subsample sizes typically favored in practice to maximize predictive accuracy, standard inference breaks down as the estimator converges to a deterministic bias constant rather than a zero-mean normal distribution. Finally, we derive the explicit analytic form of this asymptotic bias and discuss the theoretical feasibility of restoring valid inference via analytic bias correction. Our results highlight a fundamental trade-off between predictive performance and inferential validity, providing a theoretical foundation for understanding the intrinsic limitations of random forest inference in high-dimensional settings.

Asymptotic Theory and Phase Transitions for Variable Importance in Quantile Regression Forests

TL;DR

Addresses statistical inference for variable importance in Quantile Regression Forests (QRF), complicated by non-smooth pinball loss and high-dimensional bias. Develops an asymptotic theory for QRF estimators using Knight's identity, and reveals a phase transition in variable-importance inference governed by the subsampling rate β. In the bias-dominated regime (β≥1/2), VI converges to a deterministic bias, with explicit bias term derived and bias-correction proposed to restore asymptotic normality. The work highlights a fundamental inference-prediction trade-off in high-dimensional random-forest settings and provides a pathway toward valid inference via analytic bias correction.

Abstract

Quantile Regression Forests (QRF) are widely used for non-parametric conditional quantile estimation, yet statistical inference for variable importance measures remains challenging due to the non-smoothness of the loss function and the complex bias-variance trade-off. In this paper, we develop a asymptotic theory for variable importance defined as the difference in pinball loss risks. We first establish the asymptotic normality of the QRF estimator by handling the non-differentiable pinball loss via Knight's identity. Second, we uncover a "phase transition" phenomenon governed by the subsampling rate (where ). We prove that in the bias-dominated regime (), which corresponds to large subsample sizes typically favored in practice to maximize predictive accuracy, standard inference breaks down as the estimator converges to a deterministic bias constant rather than a zero-mean normal distribution. Finally, we derive the explicit analytic form of this asymptotic bias and discuss the theoretical feasibility of restoring valid inference via analytic bias correction. Our results highlight a fundamental trade-off between predictive performance and inferential validity, providing a theoretical foundation for understanding the intrinsic limitations of random forest inference in high-dimensional settings.

Paper Structure

This paper contains 61 sections, 15 theorems, 144 equations.

Key Result

Proposition 1

Theorems & Definitions (32)

  • Definition 1: Variable Importance
  • Proposition 1: Properties of Variable Importance
  • Lemma 1: Knight's Identity
  • Lemma 2: Gateaux Derivative of Risk
  • Proposition 2: Neyman Orthogonality
  • Remark 1
  • Theorem 1: Consistency and Rate of Convergence
  • proof
  • Theorem 2: Asymptotic Normality
  • proof
  • ...and 22 more