Table of Contents
Fetching ...

Sparse Projection Oblique Randomer Forests

Tyler M. Tomita, James Browne, Cencheng Shen, Jaewon Chung, Jesse L. Patsolic, Benjamin Falk, Jason Yim, Carey E. Priebe, Randal Burns, Mauro Maggioni, Joshua T. Vogelstein

TL;DR

Sparse Projection Oblique Randomer Forests (SPORF) introduces a decision forest that uses very sparse random projections to generate oblique splits, preserving axis-aligned forest benefits such as robustness, interpretability, and efficiency while enhancing accuracy over existing oblique methods. Through extensive simulations and 105 real-world benchmarks, SPORF demonstrates strong, consistent performance, robustness to hyperparameters and high-dimensional noise, and favorable scaling comparable to Random Forests. Theoretical analyses and empirical results show SPORF achieves competitive time and space complexity, with practical implementations in R and Python that enable parallelization and fast inference, including a Forest Packing acceleration for prediction. Overall, SPORF provides a scalable, interpretable, and accurate alternative to axis-aligned RF and existing oblique forests, with strong potential for integration into boosting and other ensemble frameworks.

Abstract

Decision forests, including Random Forests and Gradient Boosting Trees, have recently demonstrated state-of-the-art performance in a variety of machine learning settings. Decision forests are typically ensembles of axis-aligned decision trees; that is, trees that split only along feature dimensions. In contrast, many recent extensions to decision forests are based on axis-oblique splits. Unfortunately, these extensions forfeit one or more of the favorable properties of decision forests based on axis-aligned splits, such as robustness to many noise dimensions, interpretability, or computational efficiency. We introduce yet another decision forest, called "Sparse Projection Oblique Randomer Forests" (SPORF). SPORF uses very sparse random projections, i.e., linear combinations of a small subset of features. SPORF significantly improves accuracy over existing state-of-the-art algorithms on a standard benchmark suite for classification with >100 problems of varying dimension, sample size, and number of classes. To illustrate how SPORF addresses the limitations of both axis-aligned and existing oblique decision forest methods, we conduct extensive simulated experiments. SPORF typically yields improved performance over existing decision forests, while mitigating computational efficiency and scalability and maintaining interpretability. SPORF can easily be incorporated into other ensemble methods such as boosting to obtain potentially similar gains.

Sparse Projection Oblique Randomer Forests

TL;DR

Sparse Projection Oblique Randomer Forests (SPORF) introduces a decision forest that uses very sparse random projections to generate oblique splits, preserving axis-aligned forest benefits such as robustness, interpretability, and efficiency while enhancing accuracy over existing oblique methods. Through extensive simulations and 105 real-world benchmarks, SPORF demonstrates strong, consistent performance, robustness to hyperparameters and high-dimensional noise, and favorable scaling comparable to Random Forests. Theoretical analyses and empirical results show SPORF achieves competitive time and space complexity, with practical implementations in R and Python that enable parallelization and fast inference, including a Forest Packing acceleration for prediction. Overall, SPORF provides a scalable, interpretable, and accurate alternative to axis-aligned RF and existing oblique forests, with strong potential for integration into boosting and other ensemble frameworks.

Abstract

Decision forests, including Random Forests and Gradient Boosting Trees, have recently demonstrated state-of-the-art performance in a variety of machine learning settings. Decision forests are typically ensembles of axis-aligned decision trees; that is, trees that split only along feature dimensions. In contrast, many recent extensions to decision forests are based on axis-oblique splits. Unfortunately, these extensions forfeit one or more of the favorable properties of decision forests based on axis-aligned splits, such as robustness to many noise dimensions, interpretability, or computational efficiency. We introduce yet another decision forest, called "Sparse Projection Oblique Randomer Forests" (SPORF). SPORF uses very sparse random projections, i.e., linear combinations of a small subset of features. SPORF significantly improves accuracy over existing state-of-the-art algorithms on a standard benchmark suite for classification with >100 problems of varying dimension, sample size, and number of classes. To illustrate how SPORF addresses the limitations of both axis-aligned and existing oblique decision forest methods, we conduct extensive simulated experiments. SPORF typically yields improved performance over existing decision forests, while mitigating computational efficiency and scalability and maintaining interpretability. SPORF can easily be incorporated into other ensemble methods such as boosting to obtain potentially similar gains.

Paper Structure

This paper contains 34 sections, 6 equations, 12 figures, 2 algorithms.

Figures (12)

  • Figure 1: Classification performance on the consistency ($p = 2$) problem as a function of the number of training samples. The consistency problem is designed such that RF has a theoretical lower bound of error of 1/6. (A) The joint distribution of $(X, Y)$. $X$ is uniformly distributed in the three unit squares. The lower left and upper right squares have countably infinite stripes (a finite number of stripes are shown), and the center square is a $2 \times 2$ checkerboard. The white areas represent $f(X) = 0$ and gray areas represent $f(X) = 1$. (B) Error rate as a function of $n$. The dashed line represents the lower bound of error for RF, which is 1/6. Sporf and other oblique methods achieve an error rate dramatically lower than the lower bound for RF.
  • Figure 1: Pairwise comparisons of RF with Sporf, XGBoost, RR-RF, and CCF on the numeric datasets (top), categorical datasets (middle), and all datasets (numeric and categorical combined; bottom) from the UCI Machine Learning Repository (105 datasets total). (A) Beeswarm plots showing the distributions of classification performance relative to RF for various decision forest algorithms. Classification performance is measured by effect size, which is defined as $\kappa({\sc \texttt{RF}}) - \kappa(\mathcal{A})$, where $\kappa$ is Cohen's kappa and $\mathcal{A}$ is one of the algorithms compared to RF. Each point corresponds to a particular data set. Mean effect sizes are indicated by a black "x." A negative value on the y-axis indicates RF performed worse than a particular algorithm. (B) Histograms of the relative ranks of the different algorithms, where a rank of 1 indicates best relative classification performance and 5 indicates worst. Color indicates frequency, as fraction of data sets. P-values correspond to testing that RF, XGBoost, RR-RF, and CCF performed worse than Sporf, using one-sided Wilcoxon signed-rank tests. Overall, Sporf tends to perform better than the other algorithms.
  • Figure 1: Comparison of training times of RF, Sporf, and F-RC on the 20-dimensional sparse parity setting. (A) Dependency of training time using the best set of hyperparameters (y-axis) on the number of training samples (x-axis) for the sparse parity problem. (B) Dependency of training time (y-axis) on the number of projections sampled at each split node (x-axis) for the sparse parity problem with $n =5000$. (C) Dependency of error rate (y-axis) on the number of projections sampled at each split node (x-axis) for the sparse parity problem with $n=5000$. Sporf and F-RC can sample many more than $p$ projections, unlike RF. As seen in panels (B) and (C), increasing $d$ above $p$ meaningfully improves classification performance at the expense of larger training times. However, comparing error rates and training times at $d = 20$, Sporf can classify substantially better than RF even with no additional cost in training time.
  • Figure 1: Comparison of tree strength and correlation of Sporf, RF, and F-RC on four of the simulated datasets: (A) sparse parity with $p = 10, n = 1000$, (B) orthant with $p = 6, n = 400$, (C) Trunk with $p = 10, n = 10$, and (D) Trunk with $p = 10, n = 100$. For a particular algorithm, there are ten dots, each corresponding to one of ten trials. Note in all settings, Sporf beats RF and/or F-RC. However, the mechanism by which it does varies across the different settings. In sparse parity Sporf wins because the trees are substantially stronger, even though the correlation increases. In Trunk for small sample size, it is purely because of less correlated trees. However, when sample size increases 10-fold, it wins purely because of stronger trees. This suggests that Sporf can effectively trade-off strength for correlation on the basis of sample complexity to empirically outperform RF and F-RC.
  • Figure 1: (A-D) Bias, variance, variance effect, and error rate, respectively, on the sparse parity problem as a function of the number of training samples. Error rate is the sum of systematic effect and variance effect, which roughly measure the contributions of bias and variance to the error rate, respectively. In this example, bias and systematic effect are identical because the Bayes error is zero (refer to James2003). For smaller training sets, Sporf wins primarily through lower bias/systematic effect, while for larger training sets it wins primarily through lower variance effect.
  • ...and 7 more figures