Table of Contents
Fetching ...

Branches: Efficiently Seeking Optimal Sparse Decision Trees with AO*

Ayman Chaouki, Jesse Read, Albert Bifet

TL;DR

Branches introduces an AO*-type algorithm operating on an AND/OR graph to efficiently seek optimal sparse decision trees with a joint accuracy–sparsity objective ${\mathcal{H}}_{\lambda}(T) = {\mathcal{H}}(T) - \lambda {\mathcal{S}}(T)$. It defines a Purification Bound-based heuristic and proves optimality with quantified complexity, while supporting non-binary, multi-way splits via ordinal encoding. Empirical results show Branches often outperforms state-of-the-art DFS/BFS methods in runtime and iterations, and it provides anytime solutions even under time limits. The work highlights practical gains in interpretability and scalability for large, real-world datasets, and suggests avenues for faster implementations and hybrid search strategies in future work.

Abstract

Decision Tree (DT) Learning is a fundamental problem in Interpretable Machine Learning, yet it poses a formidable optimisation challenge. Practical algorithms have recently emerged, primarily leveraging Dynamic Programming and Branch & Bound. However, most of these approaches rely on a Depth-First-Search strategy, which is inefficient when searching for DTs at high depths and requires the definition of a maximum depth hyperparameter. Best-First-Search was also employed by other methods to circumvent these issues. The downside of this strategy is its higher memory consumption, as such, it has to be designed in a fully efficient manner that takes full advantage of the problem's structure. We formulate the problem within an AND/OR graph search framework and we solve it with a novel AO*-type algorithm called Branches. We prove both optimality and complexity guarantees for Branches and we show that it is more efficient than the state of the art theoretically and on a variety of experiments. Furthermore, Branches supports non-binary features unlike the other methods, we show that this property can further induce larger gains in computational efficiency.

Branches: Efficiently Seeking Optimal Sparse Decision Trees with AO*

TL;DR

Branches introduces an AO*-type algorithm operating on an AND/OR graph to efficiently seek optimal sparse decision trees with a joint accuracy–sparsity objective . It defines a Purification Bound-based heuristic and proves optimality with quantified complexity, while supporting non-binary, multi-way splits via ordinal encoding. Empirical results show Branches often outperforms state-of-the-art DFS/BFS methods in runtime and iterations, and it provides anytime solutions even under time limits. The work highlights practical gains in interpretability and scalability for large, real-world datasets, and suggests avenues for faster implementations and hybrid search strategies in future work.

Abstract

Decision Tree (DT) Learning is a fundamental problem in Interpretable Machine Learning, yet it poses a formidable optimisation challenge. Practical algorithms have recently emerged, primarily leveraging Dynamic Programming and Branch & Bound. However, most of these approaches rely on a Depth-First-Search strategy, which is inefficient when searching for DTs at high depths and requires the definition of a maximum depth hyperparameter. Best-First-Search was also employed by other methods to circumvent these issues. The downside of this strategy is its higher memory consumption, as such, it has to be designed in a fully efficient manner that takes full advantage of the problem's structure. We formulate the problem within an AND/OR graph search framework and we solve it with a novel AO*-type algorithm called Branches. We prove both optimality and complexity guarantees for Branches and we show that it is more efficient than the state of the art theoretically and on a variety of experiments. Furthermore, Branches supports non-binary features unlike the other methods, we show that this property can further induce larger gains in computational efficiency.
Paper Structure (40 sections, 18 theorems, 134 equations, 40 figures, 5 tables, 1 algorithm)

This paper contains 40 sections, 18 theorems, 134 equations, 40 figures, 5 tables, 1 algorithm.

Key Result

Proposition 3.0

Let $\pi$ be a policy and $l \in \mathcal{S} \setminus \mathcal{T}$, then there exists a minimum $\tau_l^\pi \ge 1$ such that for any $t \ge \tau_l^\pi$, $T_{l, t}^\pi = \{ \overline{l_1}, \ldots, \overline{l_{|T_{\tau_l^\pi}|}}\}$ is composed of terminal states only.

Figures (40)

  • Figure 1: Consider a feature space with five binary features $X^{\left( 1\right)}, X^{\left( 2\right)}, X^{\left( 3\right)}, X^{\left( 4\right)}, X^{\left( 5\right)} \in \{ 0, 1\}$. The figure provides an example of a sub-DT $T = \{ l_1, l_3, l_4\}$ rooted in $l$ that stems from splitting branch $l$ with respect to feature $X^{\left( 5\right)}$ and splitting branch $l_2$ with respect to feature $X^{\left( 2\right)}$, the red perimeter emphasises the fact that $T$ is rooted in $l$. Here $\mathcal{S}\left( T\right) = 2, l = \mathbb{I}\{ X^{\left( 3\right)} = 1\} \wedge \mathbb{I}\{ X^{\left( 1\right)} = 0\}, l_1 = l \wedge \mathbb{I}\{ X^{\left( 5\right)} = 0\}, l_2 = l \wedge \mathbb{I}\{ X^{\left( 5\right)} = 1\}, l_3 = l_2 \wedge \mathbb{I}\{ X^{\left( 2\right)} = 0\}, l_4 = l_2 \wedge \mathbb{I}\{ X^{\left( 2\right)} = 1\}$.
  • Figure 2: AND/OR graph for a classification problem with three binary features $X^{\left( 1\right)}, X^{\left( 2\right)}, X^{\left( 3\right)}$. To make the notation lighter, we represent any branch $l = \bigwedge_{v=1}^{\mathcal{S}\left( l\right)}\mathbb{I}\{ X^{\left( i_v\right)} = j_v\}$ with $i_1:j_1, \ldots, i_{\mathcal{S}\left( l\right)}:j_{\mathcal{S}\left( l\right)}$, for example $1:0, 2:1$ represents the branch $\mathbb{I}\{ X^{\left( 1\right)} = 0\} \wedge \mathbb{I}\{ X^{\left( 2\right)} = 1\}$. We colour in red the actions taken by the policy: $\pi\left( \Omega\right) = 1, \pi\left( 1:0\right) = \pi\left( 1:1\right) = 2, \pi\left( 1:0, 2:0\right) = \pi\left( 1:0, 2:1\right) = \pi\left( 1:1, 2:0\right) = \pi\left( 1:1, 2:1\right) = \overline{a}$, which also depicts the DT $T^\pi$ of $\pi$. Note that, although the curved connector associated with the terminal action $\overline{a}$ connects to the same node, it transitions to a terminal state from which no action can be taken. We represent it like this to avoid overloading the figure with additional nodes corresponding to the terminal states.
  • Figure 3: Optimal DT depicting the class variable that satisfies $Y=1$ if and only if $X^{\left( 1\right)} = 1$ or $X^{\left( 1\right)} = 3$ on the space $\mathcal{X} = \{1, 2, 3, 4\}$.
  • Figure 4: The new optimal sparse DT on the new feature space $\mathcal{X}'$.
  • Figure 5: The number of unnecessary branches introduced by Binary Encoding.
  • ...and 35 more figures

Theorems & Definitions (30)

  • Proposition 3.0
  • Proposition 3.0
  • Proposition 4.0: Bellman Optimality Equations
  • Proposition 4.0: Purification Bound
  • Theorem 4.1: Optimality
  • Theorem 4.2: Complexity
  • Theorem 5.1
  • proof
  • Proposition 7.1
  • proof
  • ...and 20 more