Table of Contents
Fetching ...

Data organization limits the predictability of binary classification

Fei Jing, Zi-Ke Zhang, Yi-Cheng Zhang, Qingpeng Zhang

TL;DR

This work develops a data-centric theory of binary classification boundaries, showing that the best attainable classifier performance—captured by AR^u, AP^u, and AC^u—depends solely on dataset characteristics rather than the chosen model. By deriving closed-form optimal solutions for square, logistic, hinge, and softmax losses, and by formulating ensembles and ground-state energies, the authors connect objective optimization to fundamental data properties, including class overlap quantified by a Jensen-Shannon-based metric $D_S$. They demonstrate that optimal ROC and PR curves share the same underlying classifier ranking, with per-sample scores tied to $\frac{\mathcal{P}(x)}{\mathcal{P}(x)+\mathcal{N}(x)}$, and they develop bounds and relationships under random divisions and feature-engineering scenarios. The results provide actionable guidance for feature selection/extraction and data design by revealing how overlap and diversity govern the theoretical ceilings of predictive performance, with extensive dataset analyses to validate the theory. Practically, this framework informs dataset construction and feature-engineering strategies to approach the fundamental limits of binary classification.

Abstract

The structure of data organization is widely recognized as having a substantial influence on the efficacy of machine learning algorithms, particularly in binary classification tasks. Our research provides a theoretical framework suggesting that the maximum potential of binary classifiers on a given dataset is primarily constrained by the inherent qualities of the data. Through both theoretical reasoning and empirical examination, we employed standard objective functions, evaluative metrics, and binary classifiers to arrive at two principal conclusions. Firstly, we show that the theoretical upper bound of binary classification performance on actual datasets can be theoretically attained. This upper boundary represents a calculable equilibrium between the learning loss and the metric of evaluation. Secondly, we have computed the precise upper bounds for three commonly used evaluation metrics, uncovering a fundamental uniformity with our overarching thesis: the upper bound is intricately linked to the dataset's characteristics, independent of the classifier in use. Additionally, our subsequent analysis uncovers a detailed relationship between the upper limit of performance and the level of class overlap within the binary classification data. This relationship is instrumental for pinpointing the most effective feature subsets for use in feature engineering.

Data organization limits the predictability of binary classification

TL;DR

This work develops a data-centric theory of binary classification boundaries, showing that the best attainable classifier performance—captured by AR^u, AP^u, and AC^u—depends solely on dataset characteristics rather than the chosen model. By deriving closed-form optimal solutions for square, logistic, hinge, and softmax losses, and by formulating ensembles and ground-state energies, the authors connect objective optimization to fundamental data properties, including class overlap quantified by a Jensen-Shannon-based metric . They demonstrate that optimal ROC and PR curves share the same underlying classifier ranking, with per-sample scores tied to , and they develop bounds and relationships under random divisions and feature-engineering scenarios. The results provide actionable guidance for feature selection/extraction and data design by revealing how overlap and diversity govern the theoretical ceilings of predictive performance, with extensive dataset analyses to validate the theory. Practically, this framework informs dataset construction and feature-engineering strategies to approach the fundamental limits of binary classification.

Abstract

The structure of data organization is widely recognized as having a substantial influence on the efficacy of machine learning algorithms, particularly in binary classification tasks. Our research provides a theoretical framework suggesting that the maximum potential of binary classifiers on a given dataset is primarily constrained by the inherent qualities of the data. Through both theoretical reasoning and empirical examination, we employed standard objective functions, evaluative metrics, and binary classifiers to arrive at two principal conclusions. Firstly, we show that the theoretical upper bound of binary classification performance on actual datasets can be theoretically attained. This upper boundary represents a calculable equilibrium between the learning loss and the metric of evaluation. Secondly, we have computed the precise upper bounds for three commonly used evaluation metrics, uncovering a fundamental uniformity with our overarching thesis: the upper bound is intricately linked to the dataset's characteristics, independent of the classifier in use. Additionally, our subsequent analysis uncovers a detailed relationship between the upper limit of performance and the level of class overlap within the binary classification data. This relationship is instrumental for pinpointing the most effective feature subsets for use in feature engineering.
Paper Structure (32 sections, 4 theorems, 122 equations, 61 figures, 4 tables)

This paper contains 32 sections, 4 theorems, 122 equations, 61 figures, 4 tables.

Key Result

Lemma 1

Upon the inclusion of a new feature into the original dataset, $\text{AR}^u$ will either increase or remain constant, while $D_{\mathcal{S}}$ will either decrease or stay the same. These values will remain unchanged if, and only if, the diversity $s_i=1$ for each $x_i$.

Figures (61)

  • Figure S1: Exact upper bound of AUC and corresponding optimal ROC curves for four real-world datasets when $|\mathcal{S}_{train}|/|\mathcal{S}|=0.1$ (A), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.2$ (B), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.3$ (C), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.4$ (D), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.5$ (E), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.6$ (F), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.7$ (G), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.8$ (H), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.9$ (I), $|\mathcal{S}_{train}|/|\mathcal{S}|=1$ (J). The binary classifiers we used in this experiment include XGBoost, MLP, SVM, Logistic Regresion, Decision Tree, Random Forest, KNN and Naive Bayes. Red curves represent the theoretical optimal ROC curves.
  • Figure S2: Exact upper bound of AP and corresponding optimal PR curves for four real-world datasets when $|\mathcal{S}_{train}|/|\mathcal{S}|=0.1$ (A), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.2$ (B), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.3$ (C), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.4$ (D), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.5$ (E), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.6$ (F), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.7$ (G), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.8$ (H), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.9$ (I), $|\mathcal{S}_{train}|/|\mathcal{S}|=1$ (J). The binary classifiers we used in this experiment include XGBoost, MLP, SVM, Logistic Regresion, Decision Tree, Random Forest, KNN and Naive Bayes. Red curves represent the theoretical optimal PR curves.
  • Figure S3: The loss errors of for four datasets in training ($\Delta_{train}^{f}$) and test sets ($\Delta_{test}^{f}$) when $|\mathcal{S}_{train}|/|\mathcal{S}|=0.1$ (A), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.2$ (B), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.3$ (C), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.4$ (D), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.5$ (E), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.6$ (F), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.7$ (G), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.8$ (H), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.9$ (I). Dash line represents the expected error of optimal classier based on Eq. \ref{['min_delta']}.
  • Figure S4: The loss errors of four datasets in training ($\Delta_{train}^{f}$) and test sets ($\Delta_{test}^{f}$) when $|\mathcal{S}_{train}|/|\mathcal{S}|=0.1$ (A), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.2$ (B), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.3$ (C), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.4$ (D), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.5$ (E), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.6$ (F), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.7$ (G), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.8$ (H), $|\mathcal{S}_{train}|/|\mathcal{S}|=0.9$ (I). Gray line represents the expected error of optimal classier based on Eq. \ref{['min_delta']}.
  • Figure S5: The loss errors of four datasets (AID, HED, INE and SUD) in training ($\Delta_{train}^{f}$) and test sets ($\Delta_{test}^{f}$) of different binary classifiers, including XGBoost with four classical objectives (A-D), MLP (E), SVM (F), Logistic Regression (G), Decision Tree (H), Random Forest (I), KNN (J). Colorful dots and lines represent different $|\mathcal{S}_{train}|/|\mathcal{S}|$ ranging from $0.1$ to $0.9$.
  • ...and 56 more figures

Theorems & Definitions (8)

  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof