Table of Contents
Fetching ...

Foundational theory for optimal decision tree problems. II. Optimal hypersurface decision tree algorithm

Xi He

TL;DR

This work advances optimal decision tree methods by introducing hypersurface splits and the first hypersurface ODT (HODT) algorithm. It builds on Part I's axiomatic ODT framework to develop theoretical guarantees around crossed hyperplanes and ancestry relations, then delivers an exactOHDT procedure that embeds data via a Veronese mapping and fuses generation, filtering, and evaluation into a single recursive pass. Recognizing intractability, the authors also propose two practical heuristics, hodtCoreset and hodtWSH, which enable scalable performance on synthetic and real-world datasets. Across extensive experiments, HODT demonstrates superior accuracy and robustness to noise compared with axis-parallel baselines when model complexity is controlled, highlighting the value of richer hypersurface splits for interpretable, high-performance decision trees. The work further points to promising future directions, including handling categorical data, mixed splitting rules, and ensembles such as random forests built from hypersurface trees.

Abstract

Decision trees are a ubiquitous model for classification and regression tasks due to their interpretability and efficiency. However, solving the optimal decision tree (ODT) problem remains a challenging combinatorial optimization task. Even for the simplest splitting rules--axis-parallel hyperplanes--it is NP-hard to optimize. In Part I of this series, we rigorously defined the proper decision tree model through four axioms and, based on these, introduced four formal definitions of the ODT problem. From these definitions, we derived four generic algorithms capable of solving ODT problems for arbitrary decision trees satisfying the axioms. We also analyzed the combinatorial geometric properties of hypersurfaces, showing that decision trees defined by polynomial hypersurface splitting rules satisfy the proper axioms that we proposed. In this second paper (Part II) of this two-part series, building on the algorithmic and geometric foundations established in Part I, we introduce the first hypersurface decision tree (HODT) algorithm. To the best of our knowledge, existing optimal decision tree methods are, to date, limited to hyperplane splitting rules--a special case of hypersurfaces--and rely on general-purpose solvers. In contrast, our HODT algorithm addresses the general hypersurface decision tree model without requiring external solvers. Using synthetic datasets generated from ground-truth hyperplane decision trees, we vary tree size, data size, dimensionality, and label and feature noise. Results showing that our algorithm recovers the ground truth more accurately than axis-parallel trees and exhibits greater robustness to noise. We also analyzed generalization performance across 30 real-world datasets, showing that HODT can achieve up to 30% higher accuracy than the state-of-the-art optimal axis-parallel decision tree algorithm when tree complexity is properly controlled.

Foundational theory for optimal decision tree problems. II. Optimal hypersurface decision tree algorithm

TL;DR

This work advances optimal decision tree methods by introducing hypersurface splits and the first hypersurface ODT (HODT) algorithm. It builds on Part I's axiomatic ODT framework to develop theoretical guarantees around crossed hyperplanes and ancestry relations, then delivers an exactOHDT procedure that embeds data via a Veronese mapping and fuses generation, filtering, and evaluation into a single recursive pass. Recognizing intractability, the authors also propose two practical heuristics, hodtCoreset and hodtWSH, which enable scalable performance on synthetic and real-world datasets. Across extensive experiments, HODT demonstrates superior accuracy and robustness to noise compared with axis-parallel baselines when model complexity is controlled, highlighting the value of richer hypersurface splits for interpretable, high-performance decision trees. The work further points to promising future directions, including handling categorical data, mixed splitting rules, and ensembles such as random forests built from hypersurface trees.

Abstract

Decision trees are a ubiquitous model for classification and regression tasks due to their interpretability and efficiency. However, solving the optimal decision tree (ODT) problem remains a challenging combinatorial optimization task. Even for the simplest splitting rules--axis-parallel hyperplanes--it is NP-hard to optimize. In Part I of this series, we rigorously defined the proper decision tree model through four axioms and, based on these, introduced four formal definitions of the ODT problem. From these definitions, we derived four generic algorithms capable of solving ODT problems for arbitrary decision trees satisfying the axioms. We also analyzed the combinatorial geometric properties of hypersurfaces, showing that decision trees defined by polynomial hypersurface splitting rules satisfy the proper axioms that we proposed. In this second paper (Part II) of this two-part series, building on the algorithmic and geometric foundations established in Part I, we introduce the first hypersurface decision tree (HODT) algorithm. To the best of our knowledge, existing optimal decision tree methods are, to date, limited to hyperplane splitting rules--a special case of hypersurfaces--and rely on general-purpose solvers. In contrast, our HODT algorithm addresses the general hypersurface decision tree model without requiring external solvers. Using synthetic datasets generated from ground-truth hyperplane decision trees, we vary tree size, data size, dimensionality, and label and feature noise. Results showing that our algorithm recovers the ground truth more accurately than axis-parallel trees and exhibits greater robustness to noise. We also analyzed generalization performance across 30 real-world datasets, showing that HODT can achieve up to 30% higher accuracy than the state-of-the-art optimal axis-parallel decision tree algorithm when tree complexity is properly controlled.

Paper Structure

This paper contains 30 sections, 3 theorems, 17 equations, 9 figures, 7 tables, 6 algorithms.

Key Result

Theorem 2

If two hyperplanes$h_{i}$ and $h_{j}$ cross each other then: no ancestry relation exists between $h_{i}$and$h_{j}$, and no hyperplanes $h_{k}$ can separate$h_{i}$ and $h_{j}$ into different branches. Consequently, any combination of hypersurfaces containing such crossed hypersurfaces cannot form a p

Figures (9)

  • Figure 1: Synthetic dataset (left) generated by degree-2 polynomials; axis-parallel decision trees learned by CART and the optimal algorithm (middle two); and hypersurface decision trees learned using our proposed algorithm (right), with corresponding misclassification counts, tree sizes, and tree depths.
  • Figure 2: Three possible ancestry relations between two hyperplanes in $\mathbb{R}^{2}$, the black black circles represent data points used to define these hyperplanes.
  • Figure 3: Three equivalent representations describing the ancestral relations between hyperplanes. A $4$-combination of lines (left), each defined by two data points (black points) in $\mathbb{R}^{2}$, where black arrows represent the normal vectors to the corresponding hyperplanes. The ancestryrelationgraph (middle) captures all ancestry relations between hyperplanes. In this graph, nodes represent hyperplanes, and arrows represent ancestral relations. An incoming arrow to a node $h_{i}$ indicates that the defining data of the corresponding hyperplane lies on the negative side of $h_{i}$. The absence of an arrow indicates no ancestral relation. Outgoing arrows represent hyperplanes whose defining data lies on the positive side of $h_{i}$. The ancestral relation matrix (right) $\boldsymbol{K}$, where the elements $\boldsymbol{K}_{ij}=1$, $\boldsymbol{K}_{ij}=-1$, and $\boldsymbol{K}_{ij}=0$ indicate that $h_{j}$ lies on the positive side, negative side of $h_{i}$, or that there is no ancestry relation between them, respectively.
  • Figure 4: An example illustrating the proof of Fact \ref{['When-two-hyperplanes']}. Demonstrating that the data items $a$, $b$, which define $h_{i}$, and $c$, $d$, which define $h_{j}$ cannot be classified into the disjoint regions defined by a third hyperplane (red).
  • Figure 5: Running time comparison between $\mathit{sodt}_{\text{rec}}$ and $\mathit{sodt}_{\text{vec}}$ with varying $K$ on sequential setting.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Definition 1
  • Theorem 2
  • Lemma 3
  • Theorem 4