Table of Contents
Fetching ...

Perils of Label Indeterminacy: A Case Study on Prediction of Neurological Recovery After Cardiac Arrest

Jakob Schoeffer, Maria De-Arteaga, Jonathan Elmer

TL;DR

The paper introduces label indeterminacy, defined as the presence of unknowable ground-truth labels for a subset of instances, which forces unverifiable or arbitrary choices in label construction. Through a cardiac-arrest recovery case study, it shows that ten different label-construction approaches yield similar predictive performance on cases with known outcomes but produce substantially different predictions for cases with indeterminate labels, including potential reversals in clinical recommendations. The study demonstrates striking disagreement (about 19.6% of indeterminate cases) and altered ranking of patients across models, highlighting ethical and evaluative challenges when decisions have irreversible consequences. It argues that current evaluation focused on known-label performance can mask critical multiplicity and advocates sociotechnical design changes and clearer reporting to account for indeterminacy and to better support high-stakes human decision-making.

Abstract

The design of AI systems to assist human decision-making typically requires the availability of labels to train and evaluate supervised models. Frequently, however, these labels are unknown, and different ways of estimating them involve unverifiable assumptions or arbitrary choices. In this work, we introduce the concept of label indeterminacy and derive important implications in high-stakes AI-assisted decision-making. We present an empirical study in a healthcare context, focusing specifically on predicting the recovery of comatose patients after resuscitation from cardiac arrest. Our study shows that label indeterminacy can result in models that perform similarly when evaluated on patients with known labels, but vary drastically in their predictions for patients where labels are unknown. After demonstrating crucial ethical implications of label indeterminacy in this high-stakes context, we discuss takeaways for evaluation, reporting, and design.

Perils of Label Indeterminacy: A Case Study on Prediction of Neurological Recovery After Cardiac Arrest

TL;DR

The paper introduces label indeterminacy, defined as the presence of unknowable ground-truth labels for a subset of instances, which forces unverifiable or arbitrary choices in label construction. Through a cardiac-arrest recovery case study, it shows that ten different label-construction approaches yield similar predictive performance on cases with known outcomes but produce substantially different predictions for cases with indeterminate labels, including potential reversals in clinical recommendations. The study demonstrates striking disagreement (about 19.6% of indeterminate cases) and altered ranking of patients across models, highlighting ethical and evaluative challenges when decisions have irreversible consequences. It argues that current evaluation focused on known-label performance can mask critical multiplicity and advocates sociotechnical design changes and clearer reporting to account for indeterminacy and to better support high-stakes human decision-making.

Abstract

The design of AI systems to assist human decision-making typically requires the availability of labels to train and evaluate supervised models. Frequently, however, these labels are unknown, and different ways of estimating them involve unverifiable assumptions or arbitrary choices. In this work, we introduce the concept of label indeterminacy and derive important implications in high-stakes AI-assisted decision-making. We present an empirical study in a healthcare context, focusing specifically on predicting the recovery of comatose patients after resuscitation from cardiac arrest. Our study shows that label indeterminacy can result in models that perform similarly when evaluated on patients with known labels, but vary drastically in their predictions for patients where labels are unknown. After demonstrating crucial ethical implications of label indeterminacy in this high-stakes context, we discuss takeaways for evaluation, reporting, and design.

Paper Structure

This paper contains 39 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Obtaining labels depends on whether patients are withdrawn from life-sustaining therapies (WLST) or not. In the case of WLST, patients die, and we do not know whether they could have recovered had they not been withdrawn from life-sustaining therapies. For WLST cases, labels are unknown.
  • Figure 2: Data setup and taxonomy of our empirical study. The whole patient population is divided into patients with (blue, left) and without (orange, right) known labels. The subset of patients with known labels is $N$, and the label indeterminacy set (i.e., WLST patients) is $W$. For each WLST patient $j \in W$, we have three expert assessments, $e_j^1,e_j^2,e_j^3 \in [0,1]$. Dashed borders indicate that the corresponding values are not ground truth.
  • Figure 3: ROC curves with confidence bounds from 5-fold cross validation on patients with known labels. Shapes are nearly indistinguishable across models.
  • Figure 4: Distribution of predictions on holdout sets of WLST cases. Distributions are strikingly different across models, but we have no way of knowing which predictions are "best" in light of label indeterminacy.
  • Figure 5: Predictions for two WLST patients with strong disagreement between models. This disagreement is entirely due to unverifiable design choices made in estimating and incorporating labels for WLST patients.
  • ...and 2 more figures