An efficient, provably optimal algorithm for the 0-1 loss linear classification problem

Xi He; Max A. Little

An efficient, provably optimal algorithm for the 0-1 loss linear classification problem

Xi He, Max A. Little

TL;DR

The paper tackles exact training for the 0-1 loss linear classifier, an NP-hard problem, and introduces Incremental Cell Enumeration (ICE) based on hyperplane arrangements and point-hyperplane duality. It proves a 0-1 loss linear classification theorem and extends to polynomial hypersurfaces via Veronese embeddings, achieving worst-case guarantees $O(N^{D+1})$ for linear and $O(N^{G+1})$ for degree-$K$ hypersurfaces. Empirically, ICE attains optimal training accuracy on small to moderate datasets and often improves generalization compared to approximate methods, while substantially outperforming branch-and-bound baselines in timing. The work advances exact, interpretable ML by bridging combinatorial geometry with optimization, and outlines paths to scalable parallel implementations and coreset-based strategies for larger problems.

Abstract

Algorithms for solving the linear classification problem have a long history, dating back at least to 1936 with linear discriminant analysis. For linearly separable data, many algorithms can obtain the exact solution to the corresponding 0-1 loss classification problem efficiently, but for data which is not linearly separable, it has been shown that this problem, in full generality, is NP-hard. Alternative approaches all involve approximations of some kind, such as the use of surrogates for the 0-1 loss (for example, the hinge or logistic loss), none of which can be guaranteed to solve the problem exactly. Finding an efficient, rigorously proven algorithm for obtaining an exact (i.e., globally optimal) solution to the 0-1 loss linear classification problem remains an open problem. By analyzing the combinatorial and incidence relations between hyperplanes and data points, we derive a rigorous construction algorithm, incremental cell enumeration (ICE), that can solve the 0-1 loss classification problem exactly in $O(N^{D+1})$. To the best of our knowledge, this is the first standalone algorithm-one that does not rely on general-purpose solvers-with rigorously proven guarantees for this problem. Moreover, we further generalize ICE to address the polynomial hypersurface classification problem in $O(N^{G+1})$ time, where $G$ is determined by both the data dimension and the polynomial hypersurface degree. The correctness of our algorithm is proved by the use of tools from the theory of hyperplane arrangements and oriented matroids. We demonstrate the effectiveness of our algorithm on real-world datasets, achieving optimal training accuracy for small-scale datasets and higher test accuracy on most datasets. Furthermore, our complexity analysis shows that the ICE algorithm offers superior computational efficiency compared with state-of-the-art branch-and-bound algorithm.

An efficient, provably optimal algorithm for the 0-1 loss linear classification problem

TL;DR

for linear and

for degree-

hypersurfaces. Empirically, ICE attains optimal training accuracy on small to moderate datasets and often improves generalization compared to approximate methods, while substantially outperforming branch-and-bound baselines in timing. The work advances exact, interpretable ML by bridging combinatorial geometry with optimization, and outlines paths to scalable parallel implementations and coreset-based strategies for larger problems.

Abstract

. To the best of our knowledge, this is the first standalone algorithm-one that does not rely on general-purpose solvers-with rigorously proven guarantees for this problem. Moreover, we further generalize ICE to address the polynomial hypersurface classification problem in

time, where

is determined by both the data dimension and the polynomial hypersurface degree. The correctness of our algorithm is proved by the use of tools from the theory of hyperplane arrangements and oriented matroids. We demonstrate the effectiveness of our algorithm on real-world datasets, achieving optimal training accuracy for small-scale datasets and higher test accuracy on most datasets. Furthermore, our complexity analysis shows that the ICE algorithm offers superior computational efficiency compared with state-of-the-art branch-and-bound algorithm.

Paper Structure (17 sections, 17 theorems, 15 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 17 sections, 17 theorems, 15 equations, 6 figures, 3 tables, 2 algorithms.

Introduction
Theory
Problem definition
Point configurations and hyperplane arrangements
Linear classification and point-hyperplane duality
Non-linear (polynomial hypersurface) classification
Incremental cell enumeration (ICE) algorithm
Empirical experiments
Exact linear (hyperplane) classification
Exact hypersurface (quadratic hypersurface) classification
Summary, discussion and future work
Proofs and definitions
Addtional experiments
Run-time complexity analysis
Out-of-Sample Generalization Test
...and 2 more sections

Key Result

Theorem 1

Incidence relations of the dual transformation. Let $\boldsymbol{p}$ be a point and a non-vertical affine hyperplane $h=\left\{ \boldsymbol{x}:\boldsymbol{w}^{T}\boldsymbol{x}=0\right\}$ in $\mathbb{R}^{D}$. Under the dual transformation $\phi$, $\boldsymbol{p}$ and $H$ satisfy the following propert

Figures (6)

Figure 1: Novel theoretical contributions enabling the ICE algorithm: identifying the necessary and sufficient dual-arrangement faces that must be enumerated to solve the 0-1 LCP. The black $\boldsymbol{\times}$ marks (unbounded cells) and red $\color{red}\boldsymbol{\times}$ marks (bounded cells) represent all the cells of a dual arrangement, with $\left|\cdot\right|$ denoting their size. In Theorem \ref{['Linear classification theorem']}, we show that exhaustively enumerating all cells and the reversals of unbounded cells (with total size $\left|\boldsymbol{\times}\right|+2\left| {\color{red}\boldsymbol{\times}} \right|$) yields a number exactly matching Cover's counting function $\mathit{Cover}$ for possible linear dichotomies (as proved in Lemma \ref{['lem: Cover bound']}). This procedure solves the linear classification problem for any objective function, filling the gap in Cover's theorem, which provides only a counting formula without specifying how to enumerate the dichotomies. Theorem \ref{['thm:0-1 loss linear classification theorem']} demonstrates that the 0--1 LCP can be solved exactly by exhaustively enumerating all blue circles in the figure and their corresponding reversed sign vectors, formally proving the correctness of SIAM-v28-nguyen13a's PCS algorithm, which had only been empirically observed to be optimal. Finally, Theorem \ref{['Symmetry-fusion-theorem.']} shows that it suffices to enumerate only the blue circles, without their reversed signs, reducing the number of configurations and enabling the construction of our incremental cell enumeration (ICE) algorithm.
Figure 2: A point configuration $\mathcal{D}$ (left-panel) and its dual arrangement $\mathcal{H_{D}}$ (right-panel). The yellow hyperplanes $w_{4}$, $w_{5}$ with two points lying on them in $\mathbb{R}^{D}$ correspond to the yellow points in the dual space, which are the intersection of corresponding dual hyperplanes $\phi\left(w_{4}\right)$, $\phi\left(w_{5}\right)$. For (blue) hyperplanes $w_{1}$, $w_{2}$, $w_{3}$ with the same prediction labels $\left(+,+,-,-\right)$, their corresponding dual points $\phi\left(w_{1}\right)$, $\phi\left(w_{2}\right)$, $\phi\left(w_{2}\right)$ lie in the same cell of dual arrangement $\phi\left(\mathcal{D}\right)$.
Figure 3: Optimal quadratic classifiers learned by the ICE algorithm (top four panels) achieve 0–1 losses of 9, 16, 17, and 16, while the approximate quadratic classifiers learned by an SVM with a degree-2 polynomial kernel (bottom four panels) obtain 0–1 losses of 17, 26, 21, and 22.
Figure 4: Log-log wall-clock run time (seconds) for the ICE algorithm in $1D$ to $4D$ synthetic datasets, against dataset size $N$, where the approximate upper bound is disabled (by setting it to $N$). The run-time curves from left to right (corresponding to $D=1,2,3,4$ respectively), have slopes 2.0, 3.1, 4.1, and 4.9, a very good match to the predicted worst-case run-time complexity of $O\left(N^{2}\right)$, $O\left(N^{3}\right)$, $O\left(N^{4}\right)$, and $O\left(N^{5}\right)$ respectively.
Figure 5: Log-linear wall-clock run time (seconds) plot comparing the ICE algorithm against the branch-and-bound (BnB) algorithm of SIAM-v28-nguyen13a (Matlab implementation provided by the authors) on three dimensional synthetic data. On this log-linear scale exponential run time appears as a linear function of problem size $N$, whereas, polynomial run time is a logarithmic function of $N$. Fitting appropriate models (lines) to the computational experiment data (dots) provides clear evidence of this prediction.
...and 1 more figures

Theorems & Definitions (27)

Definition 1
Theorem 1
Lemma 1
Lemma 2
Theorem 2
Lemma 3
Definition 2
Lemma 4
Lemma 5
Theorem 3
...and 17 more

An efficient, provably optimal algorithm for the 0-1 loss linear classification problem

TL;DR

Abstract

An efficient, provably optimal algorithm for the 0-1 loss linear classification problem

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (27)