Integrating Random Forests and Generalized Linear Models for Improved Accuracy and Interpretability
Abhineet Agarwal, Ana M. Kenney, Yan Shuo Tan, Tiffany M. Tang, Bin Yu
TL;DR
This work presents RF+ and MDI+ as a unified framework that blends random forests with generalized linear models by reinterpreting decision trees as linear models on engineered split-based features. It shows that MDIs can be understood as R^2 values of partial models, enabling a stable, regularized, and flexible approach to feature importance, while RF+ improves predictive performance via augmented features and GLMs. The PCS-inspired model selection and aggregation strategies guide practitioners toward robust, problem-tailored choices, improving both prediction and interpretation. Case studies in drug response prediction and breast cancer subtyping demonstrate improved predictive accuracy and greater stability of gene-importance rankings, highlighting the practical impact for high-stakes biomedical problems. Overall, RF+ and MDI+ offer a principled path to interpretable, reliable insights from complex tabular data while enhancing predictive power.
Abstract
Random forests (RFs) are among the most popular supervised learning algorithms due to their nonlinear flexibility and ease-of-use. However, as black box models, they can only be interpreted via algorithmically-defined feature importance methods, such as Mean Decrease in Impurity (MDI), which have been observed to be highly unstable and have ambiguous scientific meaning. Furthermore, they can perform poorly in the presence of smooth or additive structure. To address this, we reinterpret decision trees and MDI as linear regression and $R^2$ values, respectively, with respect to engineered features associated with the tree's decision splits. This allows us to combine the respective strengths of RFs and generalized linear models in a framework called RF+, which also yields an improved feature importance method we call MDI+. Through extensive data-inspired simulations and real-world datasets, we show that RF+ improves prediction accuracy over RFs and that MDI+ outperforms popular feature importance measures in identifying signal features, often yielding more than a 10% improvement over its closest competitor. In case studies on drug response prediction and breast cancer subtyping, we further show that MDI+ extracts well-established genes with significantly greater stability compared to existing feature importance measures.
