Table of Contents
Fetching ...

Cross-Cluster Weighted Forests

Maya Ramchandran, Rajarshi Mukherjee, Giovanni Parmigiani

TL;DR

This work addresses heterogeneity in training data by partitioning into estimated clusters and training an ensemble of Random Forests on each cluster, with weights learned via stacked regression to emphasize cross-cluster generalizability. The Cross-Cluster Weighted Forest (CCWF) framework is analyzed theoretically under a high-dimensional linear model, showing bias reduction as the primary mechanism for improvement over a single merged forest; upper bounds indicate substantial gains as the number of clusters grows. Extensive simulations across two-cluster and multi-cluster settings, plus robustness checks and a real genomics application (LGG), demonstrate that CCWF consistently outperforms traditional Random Forests and simple merging, particularly when clusters capture feature distribution heterogeneity. The results highlight the importance of data partitioning strategy and ensemble weighting, and they suggest CCWF as a practical, scalable approach for managing batch effects and cluster-structured data in biological and biomedical contexts.

Abstract

Adapting machine learning algorithms to better handle the presence of clusters or batch effects within training datasets is important across a wide variety of biological applications. This article considers the effect of ensembling Random Forest learners trained on clusters within a dataset with heterogeneity in the distribution of the features. We find that constructing ensembles of forests trained on clusters determined by algorithms such as k-means results in significant improvements in accuracy and generalizability over the traditional Random Forest algorithm. We begin with a theoretical exploration of the benefits of our novel approach, denoted as the Cross-Cluster Weighted Forest, and subsequently empirically examine its robustness to various data-generating scenarios and outcome models. Furthermore, we explore the influence of the data partitioning and ensemble weighting strategies on the benefits of our method over the existing paradigm. Finally, we apply our approach to cancer molecular profiling and gene expression datasets that are naturally divisible into clusters and illustrate that our approach outperforms classic Random Forest.

Cross-Cluster Weighted Forests

TL;DR

This work addresses heterogeneity in training data by partitioning into estimated clusters and training an ensemble of Random Forests on each cluster, with weights learned via stacked regression to emphasize cross-cluster generalizability. The Cross-Cluster Weighted Forest (CCWF) framework is analyzed theoretically under a high-dimensional linear model, showing bias reduction as the primary mechanism for improvement over a single merged forest; upper bounds indicate substantial gains as the number of clusters grows. Extensive simulations across two-cluster and multi-cluster settings, plus robustness checks and a real genomics application (LGG), demonstrate that CCWF consistently outperforms traditional Random Forests and simple merging, particularly when clusters capture feature distribution heterogeneity. The results highlight the importance of data partitioning strategy and ensemble weighting, and they suggest CCWF as a practical, scalable approach for managing batch effects and cluster-structured data in biological and biomedical contexts.

Abstract

Adapting machine learning algorithms to better handle the presence of clusters or batch effects within training datasets is important across a wide variety of biological applications. This article considers the effect of ensembling Random Forest learners trained on clusters within a dataset with heterogeneity in the distribution of the features. We find that constructing ensembles of forests trained on clusters determined by algorithms such as k-means results in significant improvements in accuracy and generalizability over the traditional Random Forest algorithm. We begin with a theoretical exploration of the benefits of our novel approach, denoted as the Cross-Cluster Weighted Forest, and subsequently empirically examine its robustness to various data-generating scenarios and outcome models. Furthermore, we explore the influence of the data partitioning and ensemble weighting strategies on the benefits of our method over the existing paradigm. Finally, we apply our approach to cancer molecular profiling and gene expression datasets that are naturally divisible into clusters and illustrate that our approach outperforms classic Random Forest.

Paper Structure

This paper contains 20 sections, 4 theorems, 19 equations, 7 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Given a dataset with two equally sized training clusters $\mathbb{X}_1 \stackrel{\rm i.i.d.}{\sim} [\mathrm{U}(0, 1/2)]^p$ and $\mathbb{X}_2 \stackrel{\rm i.i.d.}{\sim} [\mathrm{U}(1/2, 1)]^p$, we derive the following performance upper bounds for Merged and Ensemble. The total number of splits in ea

Figures (7)

  • Figure 1: Average RMSE of the Merged and the Ensemble as a function of the number of clusters in the training set. (A) Uniform clusters (B) Multivariate Gaussian clusters (C) Multivariate Laplace-distributed clusters
  • Figure 2: Squared bias, variance, and MSE of the Merged and Multi learners (color coded) for datasets ranging from 500-5000 total samples (100-1000 per cluster) and generated using the gaussian cluster framework. Each panel corresponds to a sample size.
  • Figure 3: Percent change in average RMSE of ensembling approaches (color labeled) compared to the Merged across different data-generating scenarios, as a function of $k$. The first row depicts results using the non-gaussian cluster simulation approach, while the second row uses a gaussian data generating model. (A.1-A.2) A linear model was used to generate the outcome from the covariates. (B.1-B.2) The binary outcome was created by using a cutoff from the linear model to create a binary step function. (C.1 - C.2) Quadratic terms for two of the variables were added to the linear outcome-generating model.
  • Figure 4: Percent change in average RMSE of ensembling approaches (color labeled) compared to the Merged across different data-generating scenarios. All simulations used a quadratic outcome and the non-gaussian cluster generating algorithm. (a) Varying the magnitude of the coefficients in the outcome-generating model to determine the effect of signal strength on prediction accuracy gains. b Varying the number of true clusters within the training set, while keeping the total sample size constant at 2500. (c) Varying the sample size per cluster, while keeping the total number of clusters per dataset constant at 5.
  • Figure 5: Distribution of the ensemble weights determined by stacked regression for (a) Cluster, (b) Random, and (c) Multi for $k$ = 20 and $k$ = 80 for the first two methods, and 5 true clusters for the latter. We used the gaussian cluster-generation framework. The distribution of the largest weight per ensemble is depicted in green, while the rest of the weights are visualized in purple. Results are shown over 100 iterations at each value of $k$.
  • ...and 2 more figures

Theorems & Definitions (9)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • Remark 3
  • proof
  • Remark 4
  • proof
  • Lemma 2
  • proof