Table of Contents
Fetching ...

Global Censored Quantile Random Forest

Siyu Zhou, Limin Peng

Abstract

In recent years, censored quantile regression has enjoyed an increasing popularity for survival analysis while many existing works rely on linearity assumptions. In this work, we propose a Global Censored Quantile Random Forest (GCQRF) for predicting a conditional quantile process on data subject to right censoring, a forest-based flexible, competitive method able to capture complex nonlinear relationships. Taking into account the randomness in trees and connecting the proposed method to a randomized incomplete infinite degree U-process (IDUP), we quantify the prediction process' variation without assuming an infinite forest and establish its weak convergence. Moreover, feature importance ranking measures based on out-of-sample predictive accuracy are proposed. We demonstrate the superior predictive accuracy of the proposed method over a number of existing alternatives and illustrate the use of the proposed importance ranking measures on both simulated and real data.

Global Censored Quantile Random Forest

Abstract

In recent years, censored quantile regression has enjoyed an increasing popularity for survival analysis while many existing works rely on linearity assumptions. In this work, we propose a Global Censored Quantile Random Forest (GCQRF) for predicting a conditional quantile process on data subject to right censoring, a forest-based flexible, competitive method able to capture complex nonlinear relationships. Taking into account the randomness in trees and connecting the proposed method to a randomized incomplete infinite degree U-process (IDUP), we quantify the prediction process' variation without assuming an infinite forest and establish its weak convergence. Moreover, feature importance ranking measures based on out-of-sample predictive accuracy are proposed. We demonstrate the superior predictive accuracy of the proposed method over a number of existing alternatives and illustrate the use of the proposed importance ranking measures on both simulated and real data.

Paper Structure

This paper contains 15 sections, 4 theorems, 44 equations, 6 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

Assume the index $\tau$ of interest lies in $[\tau_L, \tau_U]$ with bounded $\tau_L, \tau_U$. Denote $\theta_{s_n}(\tau) = \mathbb{P}(h_{s_n}(\omega, \tau))$ and $r_{1n}$ as defined in (eqn:ConvRate). Assume $\limsup \mathbb{P} H_{s_n}(\omega)^2 < \infty$ where $H_{s_n}(\omega)$ is an envelope of $h a mean-zero Gaussian process uniformly continuous with respect to $\rho(\tau, \tau')$ with covarian

Figures (6)

  • Figure 1: Comparison on predictive performance based on test relative I-QMSE in the settings of Nonlinear Homogeneous Low Dimension (upper left), Linear Homogeneous Low Dimension (upper right), (c) Nonlinear Heterogeneous Low Dimension (lower left) and Nonlinear Heterogeneous High Dimension (lower right). Results for the rest settings can be found in the supplement.
  • Figure 2: Comparison on predictive performance based on test relative I-QLoss in the settings of Nonlinear Homogeneous Low Dimension (upper left), Linear Homogeneous Low Dimension (upper right), (c) Nonlinear Heterogeneous Low Dimension (lower left) and Nonlinear Heterogeneous High Dimension (lower right). Results for the rest settings can be found in the supplement.
  • Figure 3: True conditional feature importance ranking based on increase in I-QMSE with (a) $\tau \in [0.4, 0.6]$ and (b) $\tau \in [0.1, 0.3]$. In both (a) and (b), the left and right columns correspond to the independent case ($\rho = 0$) and correlated case ($\rho = 0.9$) respectively with data generated from the model with homogeneous error (top row) or heterogeneous error (bottom row).
  • Figure 4: Estimated conditional feature importance ranking averaged over 100 simulations with training size $\bm{n=100}$ (top row) or $\bm{n=1000}$ (bottom row) based on increase in I-QMSE with $\tau \in \bm{[0.4, 0.6]}$ (left column) or $\tau \in \bm{[0.1, 0.3]}$ (right column). Within each subfigure, the left and right columns correspond to the independent case ($\rho = 0$) and correlated case ($\rho = 0.9$) respectively with data generated from the model with homogeneous error (top row) or heterogeneous error (bottom row). In each setting, the three colors correspond to different ways of muting the effect of each feature of interest: dropping (green), permuting (green) and replacing with knockoff counterparts (blue). The error bar represents 1 standard error over 100 simulations.
  • Figure 5: Comparison on predictive accuracy on the Boston Housing data
  • ...and 1 more figures

Theorems & Definitions (6)

  • Definition 1: Covering Number
  • Definition 2: Euclidean Function Class
  • Theorem 1: Weak Convergence of Randomized Complete IDUPs
  • Theorem 2: Weak Convergence of Randomized Incomplete IDUPs
  • Corollary 1
  • Proposition 1