Table of Contents
Fetching ...

A Consequentialist Critique of Binary Classification Evaluation: Theory, Practice, and Tools

Gerardo Flores, Abigail Schiff, Alyssa H. Smith, Julia A Fukuyama, Ashia C. Wilson

TL;DR

A consequentialist perspective from decision theory is adopted to argue that evaluation methods should prioritize forecast quality across thresholds and base rates, and a decision-theoretic framework is introduced that maps evaluation metrics to their appropriate use cases, accompanied by a practical Python package that lowers the barrier to applying proper scoring rules in practice.

Abstract

Machine learning-supported decisions, such as ordering diagnostic tests or determining preventive custody, often require converting probabilistic forecasts into binary classifications. We adopt a consequentialist perspective from decision theory to argue that evaluation methods should prioritize forecast quality across thresholds and base rates. This motivates the use of proper scoring rules such as the Brier score and log loss. However, our empirical review of practices at major ML venues (ICML, FAccT, CHIL) reveals a dominant reliance on top-K metrics or fixed-threshold evaluations. To bridge this disconnect, we introduce a decision-theoretic framework that maps evaluation metrics to their appropriate use cases, accompanied by a practical Python package, \texttt{briertools}, which lowers the barrier to applying proper scoring rules in practice. Methodologically, we derive and implement a clipped Brier score variant that avoids full integration and better reflects bounded, interpretable threshold ranges. Theoretically, we reconcile the Brier score with decision curve analysis, directly addressing the critique of (Assel, et al. 2017) regarding the clinical utility of proper scoring rules.

A Consequentialist Critique of Binary Classification Evaluation: Theory, Practice, and Tools

TL;DR

A consequentialist perspective from decision theory is adopted to argue that evaluation methods should prioritize forecast quality across thresholds and base rates, and a decision-theoretic framework is introduced that maps evaluation metrics to their appropriate use cases, accompanied by a practical Python package that lowers the barrier to applying proper scoring rules in practice.

Abstract

Machine learning-supported decisions, such as ordering diagnostic tests or determining preventive custody, often require converting probabilistic forecasts into binary classifications. We adopt a consequentialist perspective from decision theory to argue that evaluation methods should prioritize forecast quality across thresholds and base rates. This motivates the use of proper scoring rules such as the Brier score and log loss. However, our empirical review of practices at major ML venues (ICML, FAccT, CHIL) reveals a dominant reliance on top-K metrics or fixed-threshold evaluations. To bridge this disconnect, we introduce a decision-theoretic framework that maps evaluation metrics to their appropriate use cases, accompanied by a practical Python package, \texttt{briertools}, which lowers the barrier to applying proper scoring rules in practice. Methodologically, we derive and implement a clipped Brier score variant that avoids full integration and better reflects bounded, interpretable threshold ranges. Theoretically, we reconcile the Brier score with decision curve analysis, directly addressing the critique of (Assel, et al. 2017) regarding the clinical utility of proper scoring rules.

Paper Structure

This paper contains 62 sections, 29 theorems, 70 equations, 7 figures, 2 tables.

Key Result

Theorem 2.1

Given a calibrated model, the optimal threshold is the cost:

Figures (7)

  • Figure 1: Claude 3.5 Haiku was used to analyze 2,610 papers from three major 2024 conferences. Each plot summarizes the evaluation metrics used for binary classifiers. Accuracy dominates outside healthcare, while AUC-ROC is more prevalent within healthcare domains. Error bars come from binomial confidence intervals.
  • Figure 2: A comparison of breast cancer prediction models' performance over the range of commonly suggested thresholds for treatment.
  • Figure 3: A. ROC plot shows XGBoost has worse discrimination than Logistic Regression models as recommended by yang22. B. Log Loss Curve shows XGBoost has better average regret. C. Decomposition reconciles the two; XGBoost has much better calibration, and only slightly worse discrimination.
  • Figure 4: The figure shows the DCA (A), which can be rescaled so that for an interval of cost ratios, the area above the curve and below the prevalence $\pi$ is equal to the bounded threshold Brier score (B) or bounded threshold log loss (C).
  • Figure 5: Where the tradeoff between false positives and false negatives is one-to-one, accuracy is a good metric. Where trade-offs are generally moderate, Brier score is a good metric. Where trade-offs are frequently extreme, log loss is better.
  • ...and 2 more figures

Theorems & Definitions (55)

  • Theorem 2.1: Optimal Threshold
  • Definition 2.2: Accuracy
  • Proposition 2.3
  • Theorem 2.4: Brier Score as Uniform Mixture of Regret
  • Theorem 2.5: Log Loss as a Weighted Average of Regret
  • Definition 3.1: Net Benefit (DCA)
  • Theorem 3.2: Net Benefit as a function of regret
  • Theorem 3.3: Bounded Threshold Brier Score
  • Theorem 3.4: Bounded Threshold Log Loss
  • Remark 3.5: Bounded Threshold AUC-ROC
  • ...and 45 more