Table of Contents
Fetching ...

Mixed-Integer Linear Optimization for Cardinality-Constrained Random Forests

Jan Pablo Burgard, Maria Eduarda Pinheiro, Martin Schmidt

TL;DR

This work tackles semi-supervised binary classification when population class counts are known but labels are scarce. It introduces a big-$M$ MILP formulation (C$^2$RF) that enforces a cardinality constraint on unlabeled predictions and proposes preprocessing and branching techniques (p-C$^2$RF) to mitigate computational blow-up. Empirical results on biased sampling scenarios show that p-C$^2$RF improves accuracy and MCC over standard random forests, with substantial runtime reductions compared to the baseline MILP. The approach, applicable to other ensemble methods and accompanied by open-source code, offers a practical path to robust semi-supervised learning under known class sizes.

Abstract

Random forests are among the most famous algorithms for solving classification problems, in particular for large-scale data sets. Considering a set of labeled points and several decision trees, the method takes the majority vote to classify a new given point. In some scenarios, however, labels are only accessible for a proper subset of the given points. Moreover, this subset can be non-representative, e.g., due to collection bias. Semi-supervised learning considers the setting of labeled and unlabeled data and often improves the reliability of the results. In addition, it can be possible to obtain additional information about class sizes from undisclosed sources. We propose a mixed-integer linear optimization model for computing a semi-supervised random forest that covers the setting of labeled and unlabeled data points as well as the overall number of points in each class for a binary classification. Since the solution time rapidly grows as the number of variables increases, we present some problem-tailored preprocessing techniques and an intuitive branching rule. Our numerical results show that our approach leads to a better accuracy and a better Matthews correlation coefficient for biased samples compared to random forests by majority vote, even if only few labeled points are available.

Mixed-Integer Linear Optimization for Cardinality-Constrained Random Forests

TL;DR

This work tackles semi-supervised binary classification when population class counts are known but labels are scarce. It introduces a big- MILP formulation (CRF) that enforces a cardinality constraint on unlabeled predictions and proposes preprocessing and branching techniques (p-CRF) to mitigate computational blow-up. Empirical results on biased sampling scenarios show that p-CRF improves accuracy and MCC over standard random forests, with substantial runtime reductions compared to the baseline MILP. The approach, applicable to other ensemble methods and accompanied by open-source code, offers a practical path to robust semi-supervised learning under known class sizes.

Abstract

Random forests are among the most famous algorithms for solving classification problems, in particular for large-scale data sets. Considering a set of labeled points and several decision trees, the method takes the majority vote to classify a new given point. In some scenarios, however, labels are only accessible for a proper subset of the given points. Moreover, this subset can be non-representative, e.g., due to collection bias. Semi-supervised learning considers the setting of labeled and unlabeled data and often improves the reliability of the results. In addition, it can be possible to obtain additional information about class sizes from undisclosed sources. We propose a mixed-integer linear optimization model for computing a semi-supervised random forest that covers the setting of labeled and unlabeled data points as well as the overall number of points in each class for a binary classification. Since the solution time rapidly grows as the number of variables increases, we present some problem-tailored preprocessing techniques and an intuitive branching rule. Our numerical results show that our approach leads to a better accuracy and a better Matthews correlation coefficient for biased samples compared to random forests by majority vote, even if only few labeled points are available.
Paper Structure (12 sections, 5 theorems, 35 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 12 sections, 5 theorems, 35 equations, 3 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

A valid big-$M$ for Problem eq:Problem1 is given by $M = ut+1$, i.e., $M$ is linear in the number of trees in the forest.

Figures (3)

  • Figure 1: ECDFs for run times (in seconds)
  • Figure 2: Comparison of $\overline{\text{AC}\xspace}$ (left) and $\overline{\text{MCC}\xspace}$ (right); see \ref{['eq:comparing']}
  • Figure 3: Comparison of $\overline{\text{AC}\xspace}$ (left) and $\overline{\text{MCC}\xspace}$ (right); see \ref{['eq:comparing']}

Theorems & Definitions (10)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Proposition 4
  • proof
  • Proposition 5
  • proof