Table of Contents
Fetching ...

Near Optimal Decision Trees in a SPLIT Second

Varun Babbar, Hayden McTavish, Cynthia Rudin, Margo Seltzer

TL;DR

This work addresses the challenge of building accurate yet sparse and interpretable decision trees at scale. It introduces SPLIT, a lookahead-based framework that searches shallow prefixes optimally while completing deeper parts greedily, achieving near-optimal accuracy with significantly faster runtimes than fully optimal methods. The authors extend SPLIT with LicketySPLIT (polynomial-time variant) and RESPLIT (Rashomon-set estimation) to balance scalability and search completeness, including theoretical runtime analyses and empirical validation across diverse datasets. The approach yields large speedups, maintains competitive test loss and sparsity, and enables scalable Rashomon-set analysis for reliable feature importance and model multiplicity assessments, with potential broad impact on interpretable ML deployment.

Abstract

Decision tree optimization is fundamental to interpretable machine learning. The most popular approach is to greedily search for the best feature at every decision point, which is fast but provably suboptimal. Recent approaches find the global optimum using branch and bound with dynamic programming, showing substantial improvements in accuracy and sparsity at great cost to scalability. An ideal solution would have the accuracy of an optimal method and the scalability of a greedy method. We introduce a family of algorithms called SPLIT (SParse Lookahead for Interpretable Trees) that moves us significantly forward in achieving this ideal balance. We demonstrate that not all sub-problems need to be solved to optimality to find high quality trees; greediness suffices near the leaves. Since each depth adds an exponential number of possible trees, this change makes our algorithms orders of magnitude faster than existing optimal methods, with negligible loss in performance. We extend this algorithm to allow scalable computation of sets of near-optimal trees (i.e., the Rashomon set).

Near Optimal Decision Trees in a SPLIT Second

TL;DR

This work addresses the challenge of building accurate yet sparse and interpretable decision trees at scale. It introduces SPLIT, a lookahead-based framework that searches shallow prefixes optimally while completing deeper parts greedily, achieving near-optimal accuracy with significantly faster runtimes than fully optimal methods. The authors extend SPLIT with LicketySPLIT (polynomial-time variant) and RESPLIT (Rashomon-set estimation) to balance scalability and search completeness, including theoretical runtime analyses and empirical validation across diverse datasets. The approach yields large speedups, maintains competitive test loss and sparsity, and enables scalable Rashomon-set analysis for reliable feature importance and model multiplicity assessments, with potential broad impact on interpretable ML deployment.

Abstract

Decision tree optimization is fundamental to interpretable machine learning. The most popular approach is to greedily search for the best feature at every decision point, which is fast but provably suboptimal. Recent approaches find the global optimum using branch and bound with dynamic programming, showing substantial improvements in accuracy and sparsity at great cost to scalability. An ideal solution would have the accuracy of an optimal method and the scalability of a greedy method. We introduce a family of algorithms called SPLIT (SParse Lookahead for Interpretable Trees) that moves us significantly forward in achieving this ideal balance. We demonstrate that not all sub-problems need to be solved to optimality to find high quality trees; greediness suffices near the leaves. Since each depth adds an exponential number of possible trees, this change makes our algorithms orders of magnitude faster than existing optimal methods, with negligible loss in performance. We extend this algorithm to allow scalable computation of sets of near-optimal trees (i.e., the Rashomon set).

Paper Structure

This paper contains 74 sections, 15 theorems, 56 equations, 28 figures, 18 tables, 14 algorithms.

Key Result

Theorem 6.1

For a dataset $D$ with $k$ features and $n$ samples, depth constraint $d$ such that $d \ll k$, and lookahead depth $0 \leq d_l < d$, Algorithm alg::lookahead has runtime $\mathcal{O}(n(d-d_l)k^{d_l+1}+ nk^{d-d_l})$. If we cache repeated subproblems, the runtime reduces to $\mathcal{O}(\frac{n(d-d_l)

Figures (28)

  • Figure 1: An illustration of the power of our optimization algorithm. We train $3$ decision trees on the Bike dataset, with the aim of predicting bike rentals in Washington DC in a given time period. A greedy tree is fast but suboptimal. An optimal tree is well performing but very slow. Our algorithm strikes the perfect balance, providing well performing trees in a SPLIT second, orders of magnitude faster than optimal approaches seen in literature.
  • Figure 2: A heatmap of the proportion of splits of trees in the Rashomon set that are greedy, stratified by level, for different ($\lambda, \epsilon$) combinations. Only $4$ levels are shown as the $5^{th}$ level corresponds to the leaf. The greyed out regions in the bottom right of a plot represent ($\lambda, \epsilon$) for which the Rashomon set did not contain any trees of that depth. Generally, as we approach the leaves, the proportion of splits appearing in $\epsilon$-optimal trees become increasingly greedy. This is especially noticeable for the Netherlands, Covertype, and COMPAS datasets.
  • Figure 3: Regularized test loss vs training time (in seconds) for GOSDT gosdt_guesses vs our algorithms. The size of the points indicates the number of leaves in the resulting tree. Both SPLIT and LicketySPLIT are much faster for most values of sparsity penalty $\lambda$, with the only potential slowdown being in the sub-second regime due to overhead costs.
  • Figure 4: A comparison between the performance of our algorithms and competitors (depth budget $5$, lookahead depth $2$). The red box in the upper plot illustrates the region containing sparse and accurate models. The lower plots show the test loss vs training time for models in the red box. SPLIT and LicketySPLIT consistently lie on the bottom left of the test loss-sparsity frontier, with runtimes orders of magnitude faster than many competitors. Our algorithms also offer the ideal compromise between runtime and loss. All metrics are averaged over $3$ test-train splits.
  • Figure 5: A performance comparison between our algorithm and those in literature. The lower row are zoomed in versions of the red boxes in the upper row. This is complementary to Figure \ref{['fig:comparisons']} and shows more datasets for completeness. The depth budget for all algorithms whose depth budget can be specified is $5$.
  • ...and 23 more figures

Theorems & Definitions (27)

  • Theorem 6.1: Runtime Complexity of SPLIT
  • Corollary 6.1: Optimal Lookahead Depth for Minimal Runtime
  • Corollary 6.1: Runtime Savings of SPLIT Relative to Globally Optimal Approaches
  • Theorem 6.2: Runtime Complexity of LicketySPLIT
  • Theorem 6.3: SPLIT Can be Arbitrarily Better than Greedy
  • Theorem 1.1
  • proof
  • Theorem 1.1: Runtime Complexity of SPLIT
  • proof
  • Corollary 1.1: Optimal Lookahead Depth for Minimal Runtime
  • ...and 17 more