BooleanOCT: Optimal Classification Trees based on multivariate Boolean Rules

Jiancheng Tu; Wenqi Fan; Zhibin Wu

BooleanOCT: Optimal Classification Trees based on multivariate Boolean Rules

Jiancheng Tu, Wenqi Fan, Zhibin Wu

TL;DR

BooleanOCT tackles the interpretability-accuracy gap in tree models by introducing a mixed-integer program that learns optimal classification trees with multivariate Boolean splits. By enabling hypercube-like partitions, BooleanOCT reduces tree depth and the number of binary variables while maintaining clear interpretability, and it can optimize linear metrics (accuracy, balanced accuracy, misclassification cost) as well as nonlinear metrics like the F1-score. Across 36 binarized UCI datasets, BooleanOCT substantially outperforms traditional tree models and yields competitive or superior accuracy relative to random forests on small- to medium-scale data, with notable gains on imbalanced tasks. The approach is released as open-source, and the paper discusses scalability considerations and directions for extending to larger datasets via decomposition and multi-way splits, highlighting practical impact for high-stakes domains such as credit scoring and healthcare.

Abstract

The global optimization of classification trees has demonstrated considerable promise, notably in enhancing accuracy, optimizing size, and thereby improving human comprehensibility. While existing optimal classification trees substantially enhance accuracy over greedy-based tree models like CART, they still fall short when compared to the more complex black-box models, such as random forests. To bridge this gap, we introduce a new mixed-integer programming (MIP) formulation, grounded in multivariate Boolean rules, to derive the optimal classification tree. Our methodology integrates both linear metrics, including accuracy, balanced accuracy, and cost-sensitive cost, as well as nonlinear metrics such as the F1-score. The approach is implemented in an open-source Python package named BooleanOCT. We comprehensively benchmark these methods on the 36 datasets from the UCI machine learning repository. The proposed models demonstrate practical solvability on real-world datasets, effectively handling sizes in the tens of thousands. Aiming to maximize accuracy, this model achieves an average absolute improvement of 3.1\% and 1.5\% over random forests in small-scale and medium-sized datasets, respectively. Experiments targeting various objectives, including balanced accuracy, cost-sensitive cost, and F1-score, demonstrate the framework's wide applicability and its superiority over contemporary state-of-the-art optimal classification tree methods in small to medium-scale datasets.

BooleanOCT: Optimal Classification Trees based on multivariate Boolean Rules

TL;DR

Abstract

BooleanOCT: Optimal Classification Trees based on multivariate Boolean Rules

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (4)