Table of Contents
Fetching ...

A Powerful Random Forest Featuring Linear Extensions (RaFFLE)

Jakob Raymaekers, Peter J. Rousseeuw, Thomas Servotte, Tim Verdonck, Ruicong Yao

TL;DR

RaFFLE addresses regression tasks where CART-based trees struggle to capture linear structure by introducing a random forest whose base learners are PILOT linear model trees. It combines bootstrap sampling and per-node feature sampling with a tunable regularization parameter in the node-model selection (via BIC_alpha) and removes a costly model type to increase diversity while maintaining speed. Theoretical results show universal consistency for additive models with finite total variation and a faster convergence rate on data generated by a linear model, along with explicit time/space complexity bounds. Empirically, RaFFLE outperforms CART, RF, XGBoost, Lasso, and Ridge across 136 datasets, with both tuned and default variants performing strongly, illustrating its versatility for linear and nonlinear regression tasks. The framework offers a scalable, flexible ensemble that effectively balances accuracy and efficiency in diverse regression problems.

Abstract

Random forests are widely used in regression. However, the decision trees used as base learners are poor approximators of linear relationships. To address this limitation we propose RaFFLE (Random Forest Featuring Linear Extensions), a novel framework that integrates the recently developed PILOT trees (Piecewise Linear Organic Trees) as base learners within a random forest ensemble. PILOT trees combine the computational efficiency of traditional decision trees with the flexibility of linear model trees. To ensure sufficient diversity of the individual trees, we introduce an adjustable regularization parameter and use node-level feature sampling. These modifications improve the accuracy of the forest. We establish theoretical guarantees for the consistency of RaFFLE under weak conditions, and its faster convergence when the data are generated by a linear model. Empirical evaluations on 136 regression datasets demonstrate that RaFFLE outperforms the classical CART and random forest methods, the regularized linear methods Lasso and Ridge, and the state-of-the-art XGBoost algorithm, across both linear and nonlinear datasets. By balancing predictive accuracy and computational efficiency, RaFFLE proves to be a versatile tool for tackling a wide variety of regression problems.

A Powerful Random Forest Featuring Linear Extensions (RaFFLE)

TL;DR

RaFFLE addresses regression tasks where CART-based trees struggle to capture linear structure by introducing a random forest whose base learners are PILOT linear model trees. It combines bootstrap sampling and per-node feature sampling with a tunable regularization parameter in the node-model selection (via BIC_alpha) and removes a costly model type to increase diversity while maintaining speed. Theoretical results show universal consistency for additive models with finite total variation and a faster convergence rate on data generated by a linear model, along with explicit time/space complexity bounds. Empirically, RaFFLE outperforms CART, RF, XGBoost, Lasso, and Ridge across 136 datasets, with both tuned and default variants performing strongly, illustrating its versatility for linear and nonlinear regression tasks. The framework offers a scalable, flexible ensemble that effectively balances accuracy and efficiency in diverse regression problems.

Abstract

Random forests are widely used in regression. However, the decision trees used as base learners are poor approximators of linear relationships. To address this limitation we propose RaFFLE (Random Forest Featuring Linear Extensions), a novel framework that integrates the recently developed PILOT trees (Piecewise Linear Organic Trees) as base learners within a random forest ensemble. PILOT trees combine the computational efficiency of traditional decision trees with the flexibility of linear model trees. To ensure sufficient diversity of the individual trees, we introduce an adjustable regularization parameter and use node-level feature sampling. These modifications improve the accuracy of the forest. We establish theoretical guarantees for the consistency of RaFFLE under weak conditions, and its faster convergence when the data are generated by a linear model. Empirical evaluations on 136 regression datasets demonstrate that RaFFLE outperforms the classical CART and random forest methods, the regularized linear methods Lasso and Ridge, and the state-of-the-art XGBoost algorithm, across both linear and nonlinear datasets. By balancing predictive accuracy and computational efficiency, RaFFLE proves to be a versatile tool for tackling a wide variety of regression problems.

Paper Structure

This paper contains 12 sections, 8 theorems, 33 equations, 6 figures, 5 tables, 1 algorithm.

Key Result

Lemma 1

Assuming the response variable is bounded in $[-B,B]$ and $R_{k-1}(T)>0$ in some node $T$, then the expected impurity gain of pcon on this node satisfies where $\hat{s}_j$ is the optimal splitting point of a pcon model fit to the observations in node $T$ using the $j$-th feature.

Figures (6)

  • Figure 1: The five regression models used in PILOT: pcon, con, lin, blin and plin.
  • Figure 2: An example of a PILOT tree
  • Figure 3: Average $R^2$-scores over 5 runs on simulated linear data, for increasing training set size.
  • Figure 4: Boxplots of relative $R^2$-scores by method
  • Figure 5: Pairs plot of raw $R^2$ values of CART and the linear methods Lasso and Ridge
  • ...and 1 more figures

Theorems & Definitions (15)

  • Lemma 1
  • Lemma 2
  • Remark 1
  • Theorem 1
  • Theorem 2
  • Lemma 3
  • Lemma 4
  • Theorem 3
  • Theorem 4: Fast convergence on linear data
  • proof
  • ...and 5 more