A Consequentialist Critique of Binary Classification Evaluation: Theory, Practice, and Tools

Gerardo Flores; Abigail Schiff; Alyssa H. Smith; Julia A Fukuyama; Ashia C. Wilson

A Consequentialist Critique of Binary Classification Evaluation: Theory, Practice, and Tools

Gerardo Flores, Abigail Schiff, Alyssa H. Smith, Julia A Fukuyama, Ashia C. Wilson

TL;DR

A consequentialist perspective from decision theory is adopted to argue that evaluation methods should prioritize forecast quality across thresholds and base rates, and a decision-theoretic framework is introduced that maps evaluation metrics to their appropriate use cases, accompanied by a practical Python package that lowers the barrier to applying proper scoring rules in practice.

Abstract

Machine learning-supported decisions, such as ordering diagnostic tests or determining preventive custody, often require converting probabilistic forecasts into binary classifications. We adopt a consequentialist perspective from decision theory to argue that evaluation methods should prioritize forecast quality across thresholds and base rates. This motivates the use of proper scoring rules such as the Brier score and log loss. However, our empirical review of practices at major ML venues (ICML, FAccT, CHIL) reveals a dominant reliance on top-K metrics or fixed-threshold evaluations. To bridge this disconnect, we introduce a decision-theoretic framework that maps evaluation metrics to their appropriate use cases, accompanied by a practical Python package, \texttt{briertools}, which lowers the barrier to applying proper scoring rules in practice. Methodologically, we derive and implement a clipped Brier score variant that avoids full integration and better reflects bounded, interpretable threshold ranges. Theoretically, we reconcile the Brier score with decision curve analysis, directly addressing the critique of (Assel, et al. 2017) regarding the clinical utility of proper scoring rules.

A Consequentialist Critique of Binary Classification Evaluation: Theory, Practice, and Tools

TL;DR

Abstract

A Consequentialist Critique of Binary Classification Evaluation: Theory, Practice, and Tools

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (55)