Table of Contents
Fetching ...

RO-FIGS: Efficient and Expressive Tree-Based Ensembles for Tabular Data

Urška Matjašec, Nikola Simidjievski, Mateja Jamnik

TL;DR

RO-FIGS tackles the expressiveness bottleneck of traditional tree ensembles on tabular data by introducing oblique, multivariate splits learned from random feature subsets within an additive boosting-like framework. It builds compact ensembles (often fewer than five trees) that can capture feature interactions more efficiently than univariate splits, while maintaining interpretability through sparse oblique components and SHAP-aligned analyses. Empirical results on 22 real-world datasets show RO-FIGS achieves leading performance with smaller models; statistical tests confirm significance over several baselines, and ablation studies highlight the value of oblique splits and the minimum impurity decrease stopping rule. The work provides a practical, interpretable approach for balanced accuracy and model simplicity in real-world tabular data tasks, with an available implementation for broader adoption.

Abstract

Tree-based models are often robust to uninformative features and can accurately capture non-smooth, complex decision boundaries. Consequently, they often outperform neural network-based models on tabular datasets at a significantly lower computational cost. Nevertheless, the capability of traditional tree-based ensembles to express complex relationships efficiently is limited by using a single feature to make splits. To improve the efficiency and expressiveness of tree-based methods, we propose Random Oblique Fast Interpretable Greedy-Tree Sums (RO-FIGS). RO-FIGS builds on Fast Interpretable Greedy-Tree Sums, and extends it by learning trees with oblique or multivariate splits, where each split consists of a linear combination learnt from random subsets of features. This helps uncover interactions between features and improves performance. The proposed method is suitable for tabular datasets with both numerical and categorical features. We evaluate RO-FIGS on 22 real-world tabular datasets, demonstrating superior performance and much smaller models over other tree- and neural network-based methods. Additionally, we analyse their splits to reveal valuable insights into feature interactions, enriching the information learnt from SHAP summary plots, and thereby demonstrating the enhanced interpretability of RO-FIGS models. The proposed method is well-suited for applications, where balance between accuracy and interpretability is essential.

RO-FIGS: Efficient and Expressive Tree-Based Ensembles for Tabular Data

TL;DR

RO-FIGS tackles the expressiveness bottleneck of traditional tree ensembles on tabular data by introducing oblique, multivariate splits learned from random feature subsets within an additive boosting-like framework. It builds compact ensembles (often fewer than five trees) that can capture feature interactions more efficiently than univariate splits, while maintaining interpretability through sparse oblique components and SHAP-aligned analyses. Empirical results on 22 real-world datasets show RO-FIGS achieves leading performance with smaller models; statistical tests confirm significance over several baselines, and ablation studies highlight the value of oblique splits and the minimum impurity decrease stopping rule. The work provides a practical, interpretable approach for balanced accuracy and model simplicity in real-world tabular data tasks, with an available implementation for broader adoption.

Abstract

Tree-based models are often robust to uninformative features and can accurately capture non-smooth, complex decision boundaries. Consequently, they often outperform neural network-based models on tabular datasets at a significantly lower computational cost. Nevertheless, the capability of traditional tree-based ensembles to express complex relationships efficiently is limited by using a single feature to make splits. To improve the efficiency and expressiveness of tree-based methods, we propose Random Oblique Fast Interpretable Greedy-Tree Sums (RO-FIGS). RO-FIGS builds on Fast Interpretable Greedy-Tree Sums, and extends it by learning trees with oblique or multivariate splits, where each split consists of a linear combination learnt from random subsets of features. This helps uncover interactions between features and improves performance. The proposed method is suitable for tabular datasets with both numerical and categorical features. We evaluate RO-FIGS on 22 real-world tabular datasets, demonstrating superior performance and much smaller models over other tree- and neural network-based methods. Additionally, we analyse their splits to reveal valuable insights into feature interactions, enriching the information learnt from SHAP summary plots, and thereby demonstrating the enhanced interpretability of RO-FIGS models. The proposed method is well-suited for applications, where balance between accuracy and interpretability is essential.

Paper Structure

This paper contains 27 sections, 1 equation, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: Fitting process of RO-FIGS on a toy example. RO-FIGS adds one split to the model in each iteration. Splits are oblique and computed as linear combinations $\Phi()$ of multiple randomly selected features $F$. $t$ denotes the threshold value for splitting. The figure has been adapted from tan_2022_figs.
  • Figure 2: Comparison of the average ranks of model performance, comparing RO-FIGS against the baselines, with respect to the Friedman statistical test followed by the Bonferroni-Dunn post-hoc test (p $<$ 0.05). RO-FIGS ranks best (higher values are better), and statistically significantly better (outside the $\mathit{CD}$ interval) than FIGS and five other baseline methods: MT, ETC, DT, ODT and OT.
  • Figure 3: Left and middle: Comparison between the performance and the number of splits and trees in tree-based models. RO-FIGS models achieve the best balance, as they are well-performing and small in size. Right: Comparison between the performance and training time. RO-FIGS offers a good balance between performance and computational cost. Performance is calculated as mean normalised accuracy for each method, over all datasets, with error bars spanning the 20th/80th percentile over all datasets. Note the logarithmic scale on the x-axes and the omission of single-tree methods in the middle figure.
  • Figure 4: SHAP summary plot of RO-FIGS on the diabetes dataset (fold 0). Similar to the baselines, glucose, age, and BMI features are contributing to the model the most. RO-FIGS with the accuracy of 73.6% (see Table \ref{['tab__performance']}) outperforms all baselines on this dataset with only one split, demonstrating that a linear combination of features glucose, age, and BMI is optimal.
  • Figure 5: Comparison of fitting processes of FIGS and RO-FIGS. In each iteration, both methods add one split to the model. However, RO-FIGS computes a linear combination of multiple randomly selected features, while FIGS builds splits with one feature. $t$, $\Phi$, $F$ denote thresholds, functions, and features, respectively. The figure has been adapted from tan_2022_figs.
  • ...and 3 more figures