Table of Contents
Fetching ...

Mixed-Integer Linear Optimization for Semi-Supervised Optimal Classification Trees

Jan Pablo Burgard, Maria Eduarda Pinheiro, Martin Schmidt

TL;DR

This work addresses the challenge of learning optimal decision trees when only a subset of labels is available and external information about class sizes is accessible. It proposes S$^2$OCT, a big-$M$ MILP that jointly leverages labeled and unlabeled data and enforces a cardinality constraint on predicted unlabeled points, enabling more reliable classification under bias. Empirical results on a large set of biased and simple-random samples show that S$^2$OCT often improves accuracy and MCC compared to OCT-H, particularly in biased settings, albeit with higher computational cost. The approach provides a principled mechanism to incorporate population-level class information into tree-based models, with potential extensions to multiclass problems and scalable solution methods.

Abstract

Decision trees are one of the most popular methods for solving classification problems, mainly because of their good interpretability properties. Moreover, due to advances in recent years in mixed-integer optimization, several models have been proposed to formulate the problem of computing optimal classification trees. The goal is, given a set of labeled points, to split the feature spacewith hyperplanes and assign a class to each part of the resulting partition. In certain scenarios, however, labels are only available for a subset of the given points. Additionally, this subset may be non-representative, such as in the case of self-selection in a survey. Semi-supervised decision trees tackle the setting of labeled and unlabeled data and often contribute to enhancing the reliability of the results. Furthermore, undisclosed sources may provide extra information about the size of the classes. We propose a mixed-integer linear optimization model for computing semi-supervised optimal classification trees that cover the setting of labeled and unlabeled data points as well as the overall number of points in each class for a binary classification. Our numerical results show that our approach leads to a better accuracy and a better Matthews correlation coefficient for biased samples compared to other optimal classification trees, even if onlyfew labeled points are available.

Mixed-Integer Linear Optimization for Semi-Supervised Optimal Classification Trees

TL;DR

This work addresses the challenge of learning optimal decision trees when only a subset of labels is available and external information about class sizes is accessible. It proposes SOCT, a big- MILP that jointly leverages labeled and unlabeled data and enforces a cardinality constraint on predicted unlabeled points, enabling more reliable classification under bias. Empirical results on a large set of biased and simple-random samples show that SOCT often improves accuracy and MCC compared to OCT-H, particularly in biased settings, albeit with higher computational cost. The approach provides a principled mechanism to incorporate population-level class information into tree-based models, with potential extensions to multiclass problems and scalable solution methods.

Abstract

Decision trees are one of the most popular methods for solving classification problems, mainly because of their good interpretability properties. Moreover, due to advances in recent years in mixed-integer optimization, several models have been proposed to formulate the problem of computing optimal classification trees. The goal is, given a set of labeled points, to split the feature spacewith hyperplanes and assign a class to each part of the resulting partition. In certain scenarios, however, labels are only available for a subset of the given points. Additionally, this subset may be non-representative, such as in the case of self-selection in a survey. Semi-supervised decision trees tackle the setting of labeled and unlabeled data and often contribute to enhancing the reliability of the results. Furthermore, undisclosed sources may provide extra information about the size of the classes. We propose a mixed-integer linear optimization model for computing semi-supervised optimal classification trees that cover the setting of labeled and unlabeled data points as well as the overall number of points in each class for a binary classification. Our numerical results show that our approach leads to a better accuracy and a better Matthews correlation coefficient for biased samples compared to other optimal classification trees, even if onlyfew labeled points are available.
Paper Structure (13 sections, 3 theorems, 42 equations, 7 figures, 3 tables)

This paper contains 13 sections, 3 theorems, 42 equations, 7 figures, 3 tables.

Key Result

Lemma 1

Consider a set of continuous functions $f_k : \mathbb{R}^p \to \mathbb{R}$, $k \in [1,d]$, for some $d\in \mathbb{N}$, and let $\Omega \subseteq \mathbb{R}^p$ be given. Suppose further that there exist values $u_k>0$ such that holds for all $x \in \Omega$ and $k \in [1,d]$. Then, $x^*\in \mathbb{R}^p$ is a solution to the problem if and only if there exist $\alpha^*, \beta^* \in \mathbb{R}^d$ s

Figures (7)

  • Figure 1: A classification tree with depth $D=2$
  • Figure 2: A 2-dimensional example and the hyperplanes produced by a tree-based partitioning with $D=2$.
  • Figure 3: ECDFs for run times (in seconds).
  • Figure 4: First row: Comparison of accuracy $\overline{\text{AC}}$ as described in \ref{['comparOCTH']} for the entire data set. Second row: Comparison of $\overline{\text{AC}}$ for unlabeled data. Third row: Comparison of $\overline{\text{MCC}}$ as described in \ref{['comparOCTH']} for the entire data set. Last row: Comparison of $\overline{\text{MCC}}$ for unlabeled data. Left: Comparison for all instances. Right: Comparison only for those instances for which both approaches terminate within the time limit.
  • Figure 5: First row: Comparison of precision $\overline{\text{PR}}$ as described in \ref{['comparOCTH2']} for the entire data set. Second row: Comparison of $\overline{\text{PR}}$ for unlabeled data. Third row: Comparison of recall $\overline{\text{RE}}$ as described in \ref{['comparOCTH2']} for the entire data set. Last row: Comparison of $\overline{\text{RE}}$ for unlabeled data. Left: Comparison for all instances. Right: Comparison only for those instances for which both approaches terminate within the time limit.
  • ...and 2 more figures

Theorems & Definitions (8)

  • Definition 1: Branch Error
  • Definition 2: Leaf Error
  • Lemma 1
  • proof
  • Proposition 1
  • proof
  • Proposition 2
  • proof