Table of Contents
Fetching ...

BooleanOCT: Optimal Classification Trees based on multivariate Boolean Rules

Jiancheng Tu, Wenqi Fan, Zhibin Wu

TL;DR

BooleanOCT tackles the interpretability-accuracy gap in tree models by introducing a mixed-integer program that learns optimal classification trees with multivariate Boolean splits. By enabling hypercube-like partitions, BooleanOCT reduces tree depth and the number of binary variables while maintaining clear interpretability, and it can optimize linear metrics (accuracy, balanced accuracy, misclassification cost) as well as nonlinear metrics like the F1-score. Across 36 binarized UCI datasets, BooleanOCT substantially outperforms traditional tree models and yields competitive or superior accuracy relative to random forests on small- to medium-scale data, with notable gains on imbalanced tasks. The approach is released as open-source, and the paper discusses scalability considerations and directions for extending to larger datasets via decomposition and multi-way splits, highlighting practical impact for high-stakes domains such as credit scoring and healthcare.

Abstract

The global optimization of classification trees has demonstrated considerable promise, notably in enhancing accuracy, optimizing size, and thereby improving human comprehensibility. While existing optimal classification trees substantially enhance accuracy over greedy-based tree models like CART, they still fall short when compared to the more complex black-box models, such as random forests. To bridge this gap, we introduce a new mixed-integer programming (MIP) formulation, grounded in multivariate Boolean rules, to derive the optimal classification tree. Our methodology integrates both linear metrics, including accuracy, balanced accuracy, and cost-sensitive cost, as well as nonlinear metrics such as the F1-score. The approach is implemented in an open-source Python package named BooleanOCT. We comprehensively benchmark these methods on the 36 datasets from the UCI machine learning repository. The proposed models demonstrate practical solvability on real-world datasets, effectively handling sizes in the tens of thousands. Aiming to maximize accuracy, this model achieves an average absolute improvement of 3.1\% and 1.5\% over random forests in small-scale and medium-sized datasets, respectively. Experiments targeting various objectives, including balanced accuracy, cost-sensitive cost, and F1-score, demonstrate the framework's wide applicability and its superiority over contemporary state-of-the-art optimal classification tree methods in small to medium-scale datasets.

BooleanOCT: Optimal Classification Trees based on multivariate Boolean Rules

TL;DR

BooleanOCT tackles the interpretability-accuracy gap in tree models by introducing a mixed-integer program that learns optimal classification trees with multivariate Boolean splits. By enabling hypercube-like partitions, BooleanOCT reduces tree depth and the number of binary variables while maintaining clear interpretability, and it can optimize linear metrics (accuracy, balanced accuracy, misclassification cost) as well as nonlinear metrics like the F1-score. Across 36 binarized UCI datasets, BooleanOCT substantially outperforms traditional tree models and yields competitive or superior accuracy relative to random forests on small- to medium-scale data, with notable gains on imbalanced tasks. The approach is released as open-source, and the paper discusses scalability considerations and directions for extending to larger datasets via decomposition and multi-way splits, highlighting practical impact for high-stakes domains such as credit scoring and healthcare.

Abstract

The global optimization of classification trees has demonstrated considerable promise, notably in enhancing accuracy, optimizing size, and thereby improving human comprehensibility. While existing optimal classification trees substantially enhance accuracy over greedy-based tree models like CART, they still fall short when compared to the more complex black-box models, such as random forests. To bridge this gap, we introduce a new mixed-integer programming (MIP) formulation, grounded in multivariate Boolean rules, to derive the optimal classification tree. Our methodology integrates both linear metrics, including accuracy, balanced accuracy, and cost-sensitive cost, as well as nonlinear metrics such as the F1-score. The approach is implemented in an open-source Python package named BooleanOCT. We comprehensively benchmark these methods on the 36 datasets from the UCI machine learning repository. The proposed models demonstrate practical solvability on real-world datasets, effectively handling sizes in the tens of thousands. Aiming to maximize accuracy, this model achieves an average absolute improvement of 3.1\% and 1.5\% over random forests in small-scale and medium-sized datasets, respectively. Experiments targeting various objectives, including balanced accuracy, cost-sensitive cost, and F1-score, demonstrate the framework's wide applicability and its superiority over contemporary state-of-the-art optimal classification tree methods in small to medium-scale datasets.
Paper Structure (19 sections, 28 equations, 7 figures, 13 tables)

This paper contains 19 sections, 28 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: The BooleanOCT tree structure of the application conditions for professor's position.
  • Figure 2: The tree structures of EXAMPLE \ref{['ex:1']}.
  • Figure 3: The BooleanOCT tree structure in EXAMPLE \ref{['ex:2']}.
  • Figure 4: The subtrees of the branch nodes with multivariate Boolean rules.
  • Figure 5: A univariate classification tree for the BooleanOCT in EXAMPLE \ref{['ex:2']}.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Example 1
  • Remark 1
  • Example 2
  • Remark 2