Perils of Label Indeterminacy: A Case Study on Prediction of Neurological Recovery After Cardiac Arrest
Jakob Schoeffer, Maria De-Arteaga, Jonathan Elmer
TL;DR
The paper introduces label indeterminacy, defined as the presence of unknowable ground-truth labels for a subset of instances, which forces unverifiable or arbitrary choices in label construction. Through a cardiac-arrest recovery case study, it shows that ten different label-construction approaches yield similar predictive performance on cases with known outcomes but produce substantially different predictions for cases with indeterminate labels, including potential reversals in clinical recommendations. The study demonstrates striking disagreement (about 19.6% of indeterminate cases) and altered ranking of patients across models, highlighting ethical and evaluative challenges when decisions have irreversible consequences. It argues that current evaluation focused on known-label performance can mask critical multiplicity and advocates sociotechnical design changes and clearer reporting to account for indeterminacy and to better support high-stakes human decision-making.
Abstract
The design of AI systems to assist human decision-making typically requires the availability of labels to train and evaluate supervised models. Frequently, however, these labels are unknown, and different ways of estimating them involve unverifiable assumptions or arbitrary choices. In this work, we introduce the concept of label indeterminacy and derive important implications in high-stakes AI-assisted decision-making. We present an empirical study in a healthcare context, focusing specifically on predicting the recovery of comatose patients after resuscitation from cardiac arrest. Our study shows that label indeterminacy can result in models that perform similarly when evaluated on patients with known labels, but vary drastically in their predictions for patients where labels are unknown. After demonstrating crucial ethical implications of label indeterminacy in this high-stakes context, we discuss takeaways for evaluation, reporting, and design.
