GRANDE: Gradient-Based Decision Tree Ensembles for Tabular Data

Sascha Marton; Stefan Lüdtke; Christian Bartelt; Heiner Stuckenschmidt

GRANDE: Gradient-Based Decision Tree Ensembles for Tabular Data

Sascha Marton, Stefan Lüdtke, Christian Bartelt, Heiner Stuckenschmidt

TL;DR

This paper proposes GRANDE, a novel approach for learning hard, axis-aligned decision tree ensembles using end-to-end gradient descent, based on a dense representation of tree ensembles, which affords to use backpropagation with a straight-through operator to jointly optimize all model parameters.

Abstract

Despite the success of deep learning for text and image data, tree-based ensemble models are still state-of-the-art for machine learning with heterogeneous tabular data. However, there is a significant need for tabular-specific gradient-based methods due to their high flexibility. In this paper, we propose $\text{GRANDE}$, $\text{GRA}$die$\text{N}$t-Based $\text{D}$ecision Tree $\text{E}$nsembles, a novel approach for learning hard, axis-aligned decision tree ensembles using end-to-end gradient descent. GRANDE is based on a dense representation of tree ensembles, which affords to use backpropagation with a straight-through operator to jointly optimize all model parameters. Our method combines axis-aligned splits, which is a useful inductive bias for tabular data, with the flexibility of gradient-based optimization. Furthermore, we introduce an advanced instance-wise weighting that facilitates learning representations for both, simple and complex relations, within a single model. We conducted an extensive evaluation on a predefined benchmark with 19 classification datasets and demonstrate that our method outperforms existing gradient-boosting and deep learning frameworks on most datasets. The method is available under: https://github.com/s-marton/GRANDE

GRANDE: Gradient-Based Decision Tree Ensembles for Tabular Data

TL;DR

Abstract

die

t-Based

ecision Tree

nsembles, a novel approach for learning hard, axis-aligned decision tree ensembles using end-to-end gradient descent. GRANDE is based on a dense representation of tree ensembles, which affords to use backpropagation with a straight-through operator to jointly optimize all model parameters. Our method combines axis-aligned splits, which is a useful inductive bias for tabular data, with the flexibility of gradient-based optimization. Furthermore, we introduce an advanced instance-wise weighting that facilitates learning representations for both, simple and complex relations, within a single model. We conducted an extensive evaluation on a predefined benchmark with 19 classification datasets and demonstrate that our method outperforms existing gradient-boosting and deep learning frameworks on most datasets. The method is available under: https://github.com/s-marton/GRANDE

Paper Structure (20 sections, 8 equations, 5 figures, 29 tables)

This paper contains 20 sections, 8 equations, 5 figures, 29 tables.

Introduction
Background: Gradient-Based Decision Trees
GRANDE: Gradient-Based Decision Tree Ensembles
From Decision Trees to Weighted Tree Ensembles
Differentiable Split Functions
Instance-Wise Estimator Weights
Regularization: Feature Subset, Data Subset and Dropout
Experimental Evaluation
Experimental Setup
Results
Case Study: Instance-Wise Weighting for the PhishingWebsites Dataset
Related Work
Conclusion and Future Work
Benchmark Dataset Selction
Additional Results
...and 5 more sections

Figures (5)

Figure 1: Differentiable Split Functions. The sigmoid gradient declines smoothly, while entmoid's gradient decays more rapidly but becomes zero for large values. The scaled softsign has high gradients for small values but maintains a responsive gradient for large values, offering greater sensitivity.
Figure 2: GRANDE Architecture. This figure visualizes the structure and weighting of GRANDE for an exemplary ensemble with two trees of depth two. For each tree in the ensemble, and for every sample, we determine the weight of the leaf which the sample is assigned to.
Figure 3: Highest-Weighted Estimator. This figure visualizes the DT from GRANDE (1024 total estimators) which has the highest weight for an exemplary instance.
Figure 4: Anchors Explanations. This figure shows the local explanations generated by Anchors for the given instance. The explanation for GRANDE only comprises a single rule. In contrast, the corresponding explanations for the other methods have significantly higher complexity, which indicates that these methods are not able to learn simple representations within a complex model.
Figure 5: Performance Profile (HPO 250 Trials). The performance profile is based on the macro F1-Score with optimized hyperparameters (complete grids, 250 trials). The x-axis represents a tolerance factor, and the y-axis is a proportion of the evaluated datasets.

GRANDE: Gradient-Based Decision Tree Ensembles for Tabular Data

TL;DR

Abstract

GRANDE: Gradient-Based Decision Tree Ensembles for Tabular Data

Authors

TL;DR

Abstract

Table of Contents

Figures (5)