End-to-end Feature Selection Approach for Learning Skinny Trees

Shibal Ibrahim; Kayhan Behdin; Rahul Mazumder

End-to-end Feature Selection Approach for Learning Skinny Trees

Shibal Ibrahim, Kayhan Behdin, Rahul Mazumder

TL;DR

This work tackles the problem of selecting a compact, informative feature subset during the training of tree ensembles. It introduces Skinny Trees, an end-to-end framework that jointly learns differentiable tree ensembles and a global feature subset by employing a group $\ell_0$-$\ell_2$ regularizer and a proximal mini-batch gradient descent optimization with a dense-to-sparse learning schedule. The approach comes with convergence guarantees for its nonconvex, nonsmooth objective and demonstrates substantial improvements in feature sparsity and predictive performance across synthetic and real datasets, including significant compression and faster inference. Empirically, Skinny Trees outperform several wrapper-based and embedded feature-selection methods, offering a scalable, interpretable, and efficient alternative for feature-constrained tree ensembles.

Abstract

We propose a new optimization-based approach for feature selection in tree ensembles, an important problem in statistics and machine learning. Popular tree ensemble toolkits e.g., Gradient Boosted Trees and Random Forests support feature selection post-training based on feature importance scores, while very popular, they are known to have drawbacks. We propose Skinny Trees: an end-to-end toolkit for feature selection in tree ensembles where we train a tree ensemble while controlling the number of selected features. Our optimization-based approach learns an ensemble of differentiable trees, and simultaneously performs feature selection using a grouped $\ell_0$-regularizer. We use first-order methods for optimization and present convergence guarantees for our approach. We use a dense-to-sparse regularization scheduling scheme that can lead to more expressive and sparser tree ensembles. On 15 synthetic and real-world datasets, Skinny Trees can achieve $1.5\!\times\! -~620~\!\times\!$ feature compression rates, leading up to $10\times$ faster inference over dense trees, without any loss in performance. Skinny Trees lead to superior feature selection than many existing toolkits e.g., in terms of AUC performance for 25\% feature budget, Skinny Trees outperforms LightGBM by $10.2\%$ (up to $37.7\%$), and Random Forests by $3\%$ (up to $12.5\%$).

End-to-end Feature Selection Approach for Learning Skinny Trees

TL;DR

regularizer and a proximal mini-batch gradient descent optimization with a dense-to-sparse learning schedule. The approach comes with convergence guarantees for its nonconvex, nonsmooth objective and demonstrates substantial improvements in feature sparsity and predictive performance across synthetic and real datasets, including significant compression and faster inference. Empirically, Skinny Trees outperform several wrapper-based and embedded feature-selection methods, offering a scalable, interpretable, and efficient alternative for feature-constrained tree ensembles.

Abstract

-regularizer. We use first-order methods for optimization and present convergence guarantees for our approach. We use a dense-to-sparse regularization scheduling scheme that can lead to more expressive and sparser tree ensembles. On 15 synthetic and real-world datasets, Skinny Trees can achieve

feature compression rates, leading up to

faster inference over dense trees, without any loss in performance. Skinny Trees lead to superior feature selection than many existing toolkits e.g., in terms of AUC performance for 25\% feature budget, Skinny Trees outperforms LightGBM by

(up to

), and Random Forests by

(up to

Paper Structure (33 sections, 9 theorems, 48 equations, 4 figures, 9 tables, 2 algorithms)

This paper contains 33 sections, 9 theorems, 48 equations, 4 figures, 9 tables, 2 algorithms.

INTRODUCTION
RELATED WORK
PRELIMINARIES
PROBLEM FORMULATION
END-TO-END OPTIMIZATION APPROACH
Proximal mini-batch gradient descent
Convergence Analysis of Algorithm \ref{['algo:proximal-stochastic-gradient-descent']}
Dense-to-Sparse Learning (DSL)
SYNTHETIC EXPERIMENTS
REAL DATA EXPERIMENTS
Studying a single tree
Skinny Trees vs Dense Soft Trees
Skinny Trees vs Classical Trees
Skinny Trees vs Neural Networks
Dense-to-Sparse Learning
...and 18 more sections

Key Result

Theorem 1

Let $\lambda_2>0$ and suppose Assumptions sproperties, lproperties and o-bounded hold. Then:

Figures (4)

Figure 1: Illustration of Skinny Trees. Each horizontal slice $\bm{\mathcal{W}}_{k,:,:}$ depicts a single feature. White slices indicate features filtered out by the ensemble while training. Each vertical slice (along the depth of the page), $\bm{\mathcal{W}}_{:,:,j} = \bm{W}_j$ corresponds to parameters in $j$-th (splitting) supernode (blue circles) in the ensemble, eventually producing the routing decisions. The red squares depict leaf nodes. $S(\cdot)$ denotes an activation function, which can be Sigmoid Jordan1994 or Smooth-Step Hazimeh2020b.
Figure 2: Trajectory of validation loss and feature sparsity during training with dense-to-sparse learning.
Figure 3: Features selected by Random Forests, XGBoost and Skinny Trees for different sample sizes
Figure 4: Performance without/with Dense-to-sparse learning for different feature selection budgets.

Theorems & Definitions (18)

Theorem 1
Lemma 1
proof
Lemma 2
proof
Lemma 3
proof
Lemma 4
proof
Lemma 5
...and 8 more

End-to-end Feature Selection Approach for Learning Skinny Trees

TL;DR

Abstract

End-to-end Feature Selection Approach for Learning Skinny Trees

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (18)