Cross-Cluster Weighted Forests
Maya Ramchandran, Rajarshi Mukherjee, Giovanni Parmigiani
TL;DR
This work addresses heterogeneity in training data by partitioning into estimated clusters and training an ensemble of Random Forests on each cluster, with weights learned via stacked regression to emphasize cross-cluster generalizability. The Cross-Cluster Weighted Forest (CCWF) framework is analyzed theoretically under a high-dimensional linear model, showing bias reduction as the primary mechanism for improvement over a single merged forest; upper bounds indicate substantial gains as the number of clusters grows. Extensive simulations across two-cluster and multi-cluster settings, plus robustness checks and a real genomics application (LGG), demonstrate that CCWF consistently outperforms traditional Random Forests and simple merging, particularly when clusters capture feature distribution heterogeneity. The results highlight the importance of data partitioning strategy and ensemble weighting, and they suggest CCWF as a practical, scalable approach for managing batch effects and cluster-structured data in biological and biomedical contexts.
Abstract
Adapting machine learning algorithms to better handle the presence of clusters or batch effects within training datasets is important across a wide variety of biological applications. This article considers the effect of ensembling Random Forest learners trained on clusters within a dataset with heterogeneity in the distribution of the features. We find that constructing ensembles of forests trained on clusters determined by algorithms such as k-means results in significant improvements in accuracy and generalizability over the traditional Random Forest algorithm. We begin with a theoretical exploration of the benefits of our novel approach, denoted as the Cross-Cluster Weighted Forest, and subsequently empirically examine its robustness to various data-generating scenarios and outcome models. Furthermore, we explore the influence of the data partitioning and ensemble weighting strategies on the benefits of our method over the existing paradigm. Finally, we apply our approach to cancer molecular profiling and gene expression datasets that are naturally divisible into clusters and illustrate that our approach outperforms classic Random Forest.
