Table of Contents
Fetching ...

Lassoed Forests: Random Forests with Adaptive Lasso Post-selection

Jing Shang, James Bannon, Benjamin Haibe-Kains, Robert Tibshirani

TL;DR

This work addresses bias and variance in tree ensembles by proposing Lassoed Forest, an adaptive framework that blends vanilla random forests with post‑selection forests through data‑driven weights learned via cross‑fitting. The authors establish theory showing how the relative performance of vanilla and post‑selection methods depends on the signal‑to‑noise ratio, and prove that the adaptive objective with offset strictly improves upon both benchmarks under certain conditions. They validate the approach with extensive simulations and real‑world biomedical datasets, demonstrating robust predictive gains and effective variable selection across regression, survival, and classification tasks. The method offers a practical, nonparametric tool that adapts to data characteristics while retaining interpretability and applicability to diverse domains.

Abstract

Random forests are a statistical learning technique that use bootstrap aggregation to average high-variance and low-bias trees. Improvements to random forests, such as applying Lasso regression to the tree predictions, have been proposed in order to reduce model bias. However, these changes can sometimes degrade performance (e.g., an increase in mean squared error). In this paper, we show in theory that the relative performance of these two methods, standard and Lasso-weighted random forests, depends on the signal-to-noise ratio. We further propose a unified framework to combine random forests and Lasso selection by applying adaptive weighting and show mathematically that it can strictly outperform the other two methods. We compare the three methods through simulation, including bias-variance decomposition, error estimates evaluation, and variable importance analysis. We also show the versatility of our method by applications to a variety of real-world datasets.

Lassoed Forests: Random Forests with Adaptive Lasso Post-selection

TL;DR

This work addresses bias and variance in tree ensembles by proposing Lassoed Forest, an adaptive framework that blends vanilla random forests with post‑selection forests through data‑driven weights learned via cross‑fitting. The authors establish theory showing how the relative performance of vanilla and post‑selection methods depends on the signal‑to‑noise ratio, and prove that the adaptive objective with offset strictly improves upon both benchmarks under certain conditions. They validate the approach with extensive simulations and real‑world biomedical datasets, demonstrating robust predictive gains and effective variable selection across regression, survival, and classification tasks. The method offers a practical, nonparametric tool that adapts to data characteristics while retaining interpretability and applicability to diverse domains.

Abstract

Random forests are a statistical learning technique that use bootstrap aggregation to average high-variance and low-bias trees. Improvements to random forests, such as applying Lasso regression to the tree predictions, have been proposed in order to reduce model bias. However, these changes can sometimes degrade performance (e.g., an increase in mean squared error). In this paper, we show in theory that the relative performance of these two methods, standard and Lasso-weighted random forests, depends on the signal-to-noise ratio. We further propose a unified framework to combine random forests and Lasso selection by applying adaptive weighting and show mathematically that it can strictly outperform the other two methods. We compare the three methods through simulation, including bias-variance decomposition, error estimates evaluation, and variable importance analysis. We also show the versatility of our method by applications to a variety of real-world datasets.

Paper Structure

This paper contains 26 sections, 4 theorems, 46 equations, 13 figures, 3 tables, 2 algorithms.

Key Result

Proposition 1

Under assump:equibiasassump:homoscedasticityassump:expressivity, we additionally assume both the expectation and covariance of Lasso coefficients conditioning on pre-trained base learners are monotonically non-increasing in SNR. Then the mean squared error for both model predictions eq:est_mean and

Figures (13)

  • Figure 1: Test Error for California House Prices Prediction
  • Figure 2: Test Error for Spam Classification
  • Figure 3: Comparison of Random Forest and Post-Selection Forest
  • Figure 4: Lassoed Forest
  • Figure 5: Mean Squared Error for Polynomial Generating Functions
  • ...and 8 more figures

Theorems & Definitions (11)

  • Remark 1
  • Proposition 1
  • proof
  • Remark 2
  • Theorem 1
  • proof
  • Corollary 1.1
  • Remark 3
  • Theorem 2
  • proof
  • ...and 1 more