Table of Contents
Fetching ...

Rethinking and Recomputing the Value of Machine Learning Models

Burcu Sayin, Jie Yang, Xinyue Chen, Andrea Passerini, Fabio Casati

TL;DR

This paper tackles the mismatch between traditional ML evaluation and real-world value by introducing a value-based framework for selective classification in hybrid human-AI workflows. It defines a concrete value metric, $V(g,\mathcal{D})$, that accounts for the rejection option and downstream costs, and derives thresholding and calibration strategies to maximize value. Through NLP-based experiments across hate-speech detection, clickbait, and multi-domain sentiment tasks, the authors show that accuracy and F1-score are poor proxies for value, calibration strongly influences value, and out-of-distribution settings can erode model value unless properly calibrated. The work highlights the practical impact of choosing models not by conventional metrics but by their expected operational value, guiding deployment decisions by incorporating deferral, costs, and human-in-the-loop considerations.

Abstract

In this paper, we argue that the prevailing approach to training and evaluating machine learning models often fails to consider their real-world application within organizational or societal contexts, where they are intended to create beneficial value for people. We propose a shift in perspective, redefining model assessment and selection to emphasize integration into workflows that combine machine predictions with human expertise, particularly in scenarios requiring human intervention for low-confidence predictions. Traditional metrics like accuracy and f-score fail to capture the beneficial value of models in such hybrid settings. To address this, we introduce a simple yet theoretically sound "value" metric that incorporates task-specific costs for correct predictions, errors, and rejections, offering a practical framework for real-world evaluation. Through extensive experiments, we show that existing metrics fail to capture real-world needs, often leading to suboptimal choices in terms of value when used to rank classifiers. Furthermore, we emphasize the critical role of calibration in determining model value, showing that simple, well-calibrated models can often outperform more complex models that are challenging to calibrate.

Rethinking and Recomputing the Value of Machine Learning Models

TL;DR

This paper tackles the mismatch between traditional ML evaluation and real-world value by introducing a value-based framework for selective classification in hybrid human-AI workflows. It defines a concrete value metric, , that accounts for the rejection option and downstream costs, and derives thresholding and calibration strategies to maximize value. Through NLP-based experiments across hate-speech detection, clickbait, and multi-domain sentiment tasks, the authors show that accuracy and F1-score are poor proxies for value, calibration strongly influences value, and out-of-distribution settings can erode model value unless properly calibrated. The work highlights the practical impact of choosing models not by conventional metrics but by their expected operational value, guiding deployment decisions by incorporating deferral, costs, and human-in-the-loop considerations.

Abstract

In this paper, we argue that the prevailing approach to training and evaluating machine learning models often fails to consider their real-world application within organizational or societal contexts, where they are intended to create beneficial value for people. We propose a shift in perspective, redefining model assessment and selection to emphasize integration into workflows that combine machine predictions with human expertise, particularly in scenarios requiring human intervention for low-confidence predictions. Traditional metrics like accuracy and f-score fail to capture the beneficial value of models in such hybrid settings. To address this, we introduce a simple yet theoretically sound "value" metric that incorporates task-specific costs for correct predictions, errors, and rejections, offering a practical framework for real-world evaluation. Through extensive experiments, we show that existing metrics fail to capture real-world needs, often leading to suboptimal choices in terms of value when used to rank classifiers. Furthermore, we emphasize the critical role of calibration in determining model value, showing that simple, well-calibrated models can often outperform more complex models that are challenging to calibrate.
Paper Structure (16 sections, 15 equations, 3 figures, 6 tables)

This paper contains 16 sections, 15 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: A typical implementation of ML models into an ML solution workflow involves using a rejection function that filters predictions based on a confidence threshold. This approach generally assumes that the classifier is trained independently of the rejection logic. However, this is not a necessity—the classifier can be designed to be aware of the associated costs, which may make it less "general" but more tailored to specific needs.
  • Figure 2: Common approaches to selectivity in classification: (a) filtering predictions based on a confidence threshold, (b) employing an input-based selector model to decide on prediction acceptance, (c) using a confidence recalibrator followed by threshold-based filtering, or (d) incorporating built-in abstention with an 'I don't know' class.
  • Figure 3: Adapting selective classifiers to maximize value: (a) threshold-based selector (b) cost-sensitive threshold-based selector; (c) recalibrator + threshold-based selector. Changes with respect to standard counterparts are highlighted in red.