Table of Contents
Fetching ...

Transfer Learning with Distance Covariance for Random Forest: Error Bounds and an EHR Application

Chenze Li, Subhadeep Paul

TL;DR

This work develops a transfer-learning framework for nonparametric regression with random forests by combining a source-domain centered random forest (CRF) with a residual-target calibration driven by distance covariance. The core idea is to predict the target response using the source model, model the residuals in the target with a second CRF whose feature-splitting probabilities are weighted by estimated distance covariances with the residuals, and then sum the two predictions. The paper establishes nonasymptotic mean-squared-error bounds showing that transfer learning yields faster convergence rates when the difference between source and target functions depends on a sparse subset of features; it also shows how the method extends to standard RF via DCOV weighting. Empirically, the authors validate the approach through simulations and a large-scale eICU dataset, demonstrating meaningful gains in predicting ICU mortality for smaller hospitals by leveraging data from larger hospitals. Overall, the study provides theoretical guarantees and practical algorithms that extend transfer learning with statistical error control to random forest methods in nonparametric settings.

Abstract

Random forest is an important method for ML applications due to its broad outperformance over competing methods for structured tabular data. We propose a method for transfer learning in nonparametric regression using a centered random forest (CRF) with distance covariance-based feature weights, assuming the unknown source and target regression functions are different for a few features (sparsely different). Our method first obtains residuals from predicting the response in the target domain using a source domain-trained CRF. Then, we fit another CRF to the residuals, but with feature splitting probabilities proportional to the sample distance covariance between the features and the residuals in an independent sample. We derive an upper bound on the mean square error rate of the procedure as a function of sample sizes and difference dimension, theoretically demonstrating transfer learning benefits in random forests. In simulations, we show that the results obtained for the CRFs also hold numerically for the standard random forest (SRF) method with data-driven feature split selection. Beyond transfer learning, our results also show the benefit of distance-covariance-based weights on the performance of RF in some situations. Our method shows significant gains in predicting the mortality of ICU patients in smaller-bed target hospitals using a large multi-hospital dataset of electronic health records for 200,000 ICU patients.

Transfer Learning with Distance Covariance for Random Forest: Error Bounds and an EHR Application

TL;DR

This work develops a transfer-learning framework for nonparametric regression with random forests by combining a source-domain centered random forest (CRF) with a residual-target calibration driven by distance covariance. The core idea is to predict the target response using the source model, model the residuals in the target with a second CRF whose feature-splitting probabilities are weighted by estimated distance covariances with the residuals, and then sum the two predictions. The paper establishes nonasymptotic mean-squared-error bounds showing that transfer learning yields faster convergence rates when the difference between source and target functions depends on a sparse subset of features; it also shows how the method extends to standard RF via DCOV weighting. Empirically, the authors validate the approach through simulations and a large-scale eICU dataset, demonstrating meaningful gains in predicting ICU mortality for smaller hospitals by leveraging data from larger hospitals. Overall, the study provides theoretical guarantees and practical algorithms that extend transfer learning with statistical error control to random forest methods in nonparametric settings.

Abstract

Random forest is an important method for ML applications due to its broad outperformance over competing methods for structured tabular data. We propose a method for transfer learning in nonparametric regression using a centered random forest (CRF) with distance covariance-based feature weights, assuming the unknown source and target regression functions are different for a few features (sparsely different). Our method first obtains residuals from predicting the response in the target domain using a source domain-trained CRF. Then, we fit another CRF to the residuals, but with feature splitting probabilities proportional to the sample distance covariance between the features and the residuals in an independent sample. We derive an upper bound on the mean square error rate of the procedure as a function of sample sizes and difference dimension, theoretically demonstrating transfer learning benefits in random forests. In simulations, we show that the results obtained for the CRFs also hold numerically for the standard random forest (SRF) method with data-driven feature split selection. Beyond transfer learning, our results also show the benefit of distance-covariance-based weights on the performance of RF in some situations. Our method shows significant gains in predicting the mortality of ICU patients in smaller-bed target hospitals using a large multi-hospital dataset of electronic health records for 200,000 ICU patients.

Paper Structure

This paper contains 24 sections, 10 theorems, 101 equations, 13 figures, 4 algorithms.

Key Result

Proposition 1

Assume $p_j^{(s)}=\frac{1}{d},\ j=1,\cdots, d$. Under Assumption assmp1 and Assumption assmp2 and conditional on $(p_j^{(s)})_{1 \le j \le d}$, Further, if we define $z_s := \frac{2 \log_2(1 - 1/2d)}{2 \log_2(1 - 1/2d) - 1}$, and choose $k_{n_s} = c \bigl( n_s \bigl( \log_2^{\,d - 1} n_s \bigr)^{1/2} \bigr)^{1 - z_s}$, with some constant $c > 0$ independent of $n_s$, then, given $(p_j^{(s)})_{1 \

Figures (13)

  • Figure 1: Distance covariance aided standard random forest for sparse functions
  • Figure 2: Benefits of distance covariance weighting when some features dominate
  • Figure 3: The result shows the performance of the transfer learning algorithm and centered random forest on test data set and $n_s = 20000, n_t = 500, n_{test} = 100, d = 50$
  • Figure 4: The result shows the performance of the transfer learning algorithm and centered random forest on test data set. $n_s = 10000, n_{test} = 100, d = 50, r = 0.1$.
  • Figure 5: $n_s = 10000, n_t = 500, n_{test} = 100, d = 50, r = 0.1$
  • ...and 8 more figures

Theorems & Definitions (19)

  • Definition 1
  • Proposition 1: Centered random forests, klusowski2021sharp
  • Lemma 1
  • Theorem 1
  • Lemma 2
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Corollary 1
  • Proposition 2
  • ...and 9 more