Table of Contents
Fetching ...

Integrating Random Forests and Generalized Linear Models for Improved Accuracy and Interpretability

Abhineet Agarwal, Ana M. Kenney, Yan Shuo Tan, Tiffany M. Tang, Bin Yu

TL;DR

This work presents RF+ and MDI+ as a unified framework that blends random forests with generalized linear models by reinterpreting decision trees as linear models on engineered split-based features. It shows that MDIs can be understood as R^2 values of partial models, enabling a stable, regularized, and flexible approach to feature importance, while RF+ improves predictive performance via augmented features and GLMs. The PCS-inspired model selection and aggregation strategies guide practitioners toward robust, problem-tailored choices, improving both prediction and interpretation. Case studies in drug response prediction and breast cancer subtyping demonstrate improved predictive accuracy and greater stability of gene-importance rankings, highlighting the practical impact for high-stakes biomedical problems. Overall, RF+ and MDI+ offer a principled path to interpretable, reliable insights from complex tabular data while enhancing predictive power.

Abstract

Random forests (RFs) are among the most popular supervised learning algorithms due to their nonlinear flexibility and ease-of-use. However, as black box models, they can only be interpreted via algorithmically-defined feature importance methods, such as Mean Decrease in Impurity (MDI), which have been observed to be highly unstable and have ambiguous scientific meaning. Furthermore, they can perform poorly in the presence of smooth or additive structure. To address this, we reinterpret decision trees and MDI as linear regression and $R^2$ values, respectively, with respect to engineered features associated with the tree's decision splits. This allows us to combine the respective strengths of RFs and generalized linear models in a framework called RF+, which also yields an improved feature importance method we call MDI+. Through extensive data-inspired simulations and real-world datasets, we show that RF+ improves prediction accuracy over RFs and that MDI+ outperforms popular feature importance measures in identifying signal features, often yielding more than a 10% improvement over its closest competitor. In case studies on drug response prediction and breast cancer subtyping, we further show that MDI+ extracts well-established genes with significantly greater stability compared to existing feature importance measures.

Integrating Random Forests and Generalized Linear Models for Improved Accuracy and Interpretability

TL;DR

This work presents RF+ and MDI+ as a unified framework that blends random forests with generalized linear models by reinterpreting decision trees as linear models on engineered split-based features. It shows that MDIs can be understood as R^2 values of partial models, enabling a stable, regularized, and flexible approach to feature importance, while RF+ improves predictive performance via augmented features and GLMs. The PCS-inspired model selection and aggregation strategies guide practitioners toward robust, problem-tailored choices, improving both prediction and interpretation. Case studies in drug response prediction and breast cancer subtyping demonstrate improved predictive accuracy and greater stability of gene-importance rankings, highlighting the practical impact for high-stakes biomedical problems. Overall, RF+ and MDI+ offer a principled path to interpretable, reliable insights from complex tabular data while enhancing predictive power.

Abstract

Random forests (RFs) are among the most popular supervised learning algorithms due to their nonlinear flexibility and ease-of-use. However, as black box models, they can only be interpreted via algorithmically-defined feature importance methods, such as Mean Decrease in Impurity (MDI), which have been observed to be highly unstable and have ambiguous scientific meaning. Furthermore, they can perform poorly in the presence of smooth or additive structure. To address this, we reinterpret decision trees and MDI as linear regression and values, respectively, with respect to engineered features associated with the tree's decision splits. This allows us to combine the respective strengths of RFs and generalized linear models in a framework called RF+, which also yields an improved feature importance method we call MDI+. Through extensive data-inspired simulations and real-world datasets, we show that RF+ improves prediction accuracy over RFs and that MDI+ outperforms popular feature importance measures in identifying signal features, often yielding more than a 10% improvement over its closest competitor. In case studies on drug response prediction and breast cancer subtyping, we further show that MDI+ extracts well-established genes with significantly greater stability compared to existing feature importance measures.
Paper Structure (93 sections, 3 theorems, 22 equations, 29 figures, 4 tables)

This paper contains 93 sections, 3 theorems, 22 equations, 29 figures, 4 tables.

Key Result

Proposition 1

Let $\hat{f}$ denote the CART model obtained from splits $\mathcal{S}$ and data $\mathcal{D}_{n}$. Let $\hat{\boldsymbol{\beta}} \coloneqq (\hat{\beta}_1,\ldots,\hat{\beta}_p)$ be the OLS coefficients obtained when regressing $\mathbf{y}$ on $\Psi\left(\mathbf{X};\mathcal{S},\mathcal{D}_{n}\right)$,

Figures (29)

  • Figure 1: Overview of MDI+ for a single tree. For each tree $\mathcal{S}$ in the random forest, Step 1: Obtain the transformed dataset on the in- and out-of-bag samples using stumps from the tree and append the raw and/or any additional (possibly engineered) features. Step 2: Fit a regularized GLM. Step 3: Using the fitted GLM, make partial model predictions $\hat{\mathbf{y}}^{(k)}$ for each feature $k = 1, \ldots, p$ (stacked boxes) using a leave-one-out (LOO) data splitting scheme. Step 4: For each $k = 1, \ldots, p$, evaluate partial model predictions via any user-defined similarity metric to obtain the MDI+ for feature $k$ in tree $\mathcal{S}$.
  • Figure 2: Relative performance of RF+ (ridge) as compared to RF in both the (A) regression and (B) classification settings. In the regression setting, RF+ (ridge) increases performance by an average of 4.4% averaged across all 18 drugs that have test set $R^2 > 0.1$. In the classification setting, RF+ (logistic) increases performance according to the $F1$ score for three of the four datasets, and on average by $3.6\%$. RF+ (logistic) increases AUPRC for all datasets, and on average by $2.3\%$.
  • Figure 3: MDI+ outperforms all other feature importance methods for the data-inspired regression simulations described in Section \ref{['subsec:regression_results']} using the Splicing dataset. This pattern is evident across various regression functions (specified by row), proportions of variance explained (specified by column), and sample sizes (on the $x$-axis). In all subplots, the AUROC has been averaged across 50 experimental replicates, and error bars represent $\pm$ 1SE.
  • Figure 4: Both MDI+ (ridge) and MDI+ (logistic) outperform all other feature importance methods for the data-inspired classification simulations described in Section \ref{['subsec:classification_results']} using the CCLE RNASeq dataset. Furthermore, MDI+ (logistic) slightly outperforms MDI+ (ridge) in the majority of settings, indicating the benefit of tailoring the choices of MDI+ to the data at hand. This pattern is evident across various regression functions (specified by row), proportions of corrupted labels (specified by column), and sample sizes (on the $x$-axis). In all subplots, the AUROC has been averaged across 50 experimental replicates, and error bars represent $\pm$ 1SE.
  • Figure 5: Under the LSS with outliers regression setting using the Enhancer dataset (described in Section \ref{['subsec:robust_results']}), MDI+ (Huber)'s performance remains suffers far less than other methods including MDI+ (Ridge) as the level of corruption $\mu_{corrupt}$ (specified by row) and the proportion of outliers (specified by column) grow. This pattern also holds across sample sizes (on the $x$-axis). In all subplots, the AUROC has been averaged across 50 experimental replicates, and error bars represent $\pm$ 1SE.
  • ...and 24 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Remark 2
  • Theorem 3
  • Proposition 4