Table of Contents
Fetching ...

Acceleration Techniques for Learning Optimal Classification Trees with Integer Programming

Mitchell Keegan, Michael Forbes, Paul Corry, Mahdi Abolghasemi

TL;DR

This work tackles learning globally optimal classification trees (OCTs) by accelerating the BendOCT mixed-integer programming formulation. It derives BendOCT via logic-based Benders decomposition (LBBD) and introduces a suite of enhancements: strengthened Benders cuts, a solution-polishing primal heuristic, equivalent-point inequalities (EQP), and path-bound cutting planes that leverage depth-2 subtree optimals. Empirical results across 33 datasets show dramatic scalability gains, solving many more instances to optimality within a 1-hour limit (e.g., 1173/1620 vs 582/1620 for the baseline), highlighting the value of DP-inspired bounds within a MIP framework. The approach broadens the practical applicability of optimal decision trees by combining flexibility with significantly improved convergence, with potential extensions to other objectives and deeper trees.

Abstract

Decision trees are a popular machine learning model which are traditionally trained by heuristic methods. Massive improvements in computing power and optimisation techniques has led to renewed interest in learning globally optimal decision trees. Empirical evidence shows that optimal classification trees (OCTs) have better out-of-sample performance than heuristic methods. The dominant optimisation paradigms for training OCTs are mixed-integer programming (MIP) and dynamic programming (DP). MIP formulations offer flexibility in the objectives and constraints that are modelled, but suffer from poor scaling in the size of the training dataset and the maximum tree depth. DP models represent the state of the art in scaling for OCTs, but lack some of the flexibility of MIP models. In this paper we present progress on using advanced integer programming methods to integrate ideas from DP models into MIP formulations to begin bridging the scaling gap. Using the existing BendOCT model from the literature as a base model, we introduce valid inequalities, cutting planes, and a primal heuristic to improve the scaling of MIP formulations. We show that these techniques significantly improve the ability of BendOCT to find provably optimal solutions over a wide range of datasets.

Acceleration Techniques for Learning Optimal Classification Trees with Integer Programming

TL;DR

This work tackles learning globally optimal classification trees (OCTs) by accelerating the BendOCT mixed-integer programming formulation. It derives BendOCT via logic-based Benders decomposition (LBBD) and introduces a suite of enhancements: strengthened Benders cuts, a solution-polishing primal heuristic, equivalent-point inequalities (EQP), and path-bound cutting planes that leverage depth-2 subtree optimals. Empirical results across 33 datasets show dramatic scalability gains, solving many more instances to optimality within a 1-hour limit (e.g., 1173/1620 vs 582/1620 for the baseline), highlighting the value of DP-inspired bounds within a MIP framework. The approach broadens the practical applicability of optimal decision trees by combining flexibility with significantly improved convergence, with potential extensions to other objectives and deeper trees.

Abstract

Decision trees are a popular machine learning model which are traditionally trained by heuristic methods. Massive improvements in computing power and optimisation techniques has led to renewed interest in learning globally optimal decision trees. Empirical evidence shows that optimal classification trees (OCTs) have better out-of-sample performance than heuristic methods. The dominant optimisation paradigms for training OCTs are mixed-integer programming (MIP) and dynamic programming (DP). MIP formulations offer flexibility in the objectives and constraints that are modelled, but suffer from poor scaling in the size of the training dataset and the maximum tree depth. DP models represent the state of the art in scaling for OCTs, but lack some of the flexibility of MIP models. In this paper we present progress on using advanced integer programming methods to integrate ideas from DP models into MIP formulations to begin bridging the scaling gap. Using the existing BendOCT model from the literature as a base model, we introduce valid inequalities, cutting planes, and a primal heuristic to improve the scaling of MIP formulations. We show that these techniques significantly improve the ability of BendOCT to find provably optimal solutions over a wide range of datasets.

Paper Structure

This paper contains 33 sections, 29 equations, 12 figures, 5 tables, 2 algorithms.

Figures (12)

  • Figure 1: Basic decision tree notation. The available nodes are partitioned into internal nodes $\mathcal{B}$ and terminal nodes $\mathcal{T}$. Nodes 1 and 2 have been designated as branch nodes while nodes 3, 4, and 5 are designated as leaf nodes. Nodes 6 and 7 are cut off.
  • Figure 2: Logic-based Benders cut example tree structure. The sample is routed into node $5$ where it is misclassified.
  • Figure 3: A subset of a tree with an integral path in the relaxation solution. It is partitioned into the integral path, the depth two subtree to be optimised, and the nodes downstream of the optimised subtree. Decision variables associated with nodes not in the integral path may be fractional.
  • Figure 4: Example quantile bucket encoding on continuous feature $f^N$ with two classes. $f_N$ will have five associated binary features where the $m^{th}$ binary feature is equal to one whenever the feature value falls into the $m^{th}$ bucket and zero otherwise.
  • Figure 5: Example quantile threshold encoding on continuous feature $f^N$ with two classes. $f_N$ will have four associated binary features where the $m^{th}$ binary feature is equal to one whenever the feature value is greater than or equal to the $m^{th}$ threshold and zero otherwise.
  • ...and 7 more figures