Data organization limits the predictability of binary classification
Fei Jing, Zi-Ke Zhang, Yi-Cheng Zhang, Qingpeng Zhang
TL;DR
This work develops a data-centric theory of binary classification boundaries, showing that the best attainable classifier performance—captured by AR^u, AP^u, and AC^u—depends solely on dataset characteristics rather than the chosen model. By deriving closed-form optimal solutions for square, logistic, hinge, and softmax losses, and by formulating ensembles and ground-state energies, the authors connect objective optimization to fundamental data properties, including class overlap quantified by a Jensen-Shannon-based metric $D_S$. They demonstrate that optimal ROC and PR curves share the same underlying classifier ranking, with per-sample scores tied to $\frac{\mathcal{P}(x)}{\mathcal{P}(x)+\mathcal{N}(x)}$, and they develop bounds and relationships under random divisions and feature-engineering scenarios. The results provide actionable guidance for feature selection/extraction and data design by revealing how overlap and diversity govern the theoretical ceilings of predictive performance, with extensive dataset analyses to validate the theory. Practically, this framework informs dataset construction and feature-engineering strategies to approach the fundamental limits of binary classification.
Abstract
The structure of data organization is widely recognized as having a substantial influence on the efficacy of machine learning algorithms, particularly in binary classification tasks. Our research provides a theoretical framework suggesting that the maximum potential of binary classifiers on a given dataset is primarily constrained by the inherent qualities of the data. Through both theoretical reasoning and empirical examination, we employed standard objective functions, evaluative metrics, and binary classifiers to arrive at two principal conclusions. Firstly, we show that the theoretical upper bound of binary classification performance on actual datasets can be theoretically attained. This upper boundary represents a calculable equilibrium between the learning loss and the metric of evaluation. Secondly, we have computed the precise upper bounds for three commonly used evaluation metrics, uncovering a fundamental uniformity with our overarching thesis: the upper bound is intricately linked to the dataset's characteristics, independent of the classifier in use. Additionally, our subsequent analysis uncovers a detailed relationship between the upper limit of performance and the level of class overlap within the binary classification data. This relationship is instrumental for pinpointing the most effective feature subsets for use in feature engineering.
