Table of Contents
Fetching ...

Interpreting Predictive Probabilities: Model Confidence or Human Label Variation?

Joris Baan, Raquel Fernández, Barbara Plank, Wilker Aziz

TL;DR

This position paper analyzes two interpretations of predictive probabilities in NLP: model confidence (calibration-related uncertainty about model error) and human-label variation (uncertainty about outcomes due to annotator perspectives). It argues that both sources are essential for trustworthy and fair NLP, but a single predictive distribution is insufficient to capture them all. The authors review calibration and human-annotation perspectives, discuss their merits and limitations, and connect them to aleatoric and epistemic uncertainty. They propose directions for disentangled uncertainty representations, such as separate error-prediction modules, Bayesian and conformal methods, and improved annotator data practices to enhance both trust and fairness in NLP systems.

Abstract

With the rise of increasingly powerful and user-facing NLP systems, there is growing interest in assessing whether they have a good representation of uncertainty by evaluating the quality of their predictive distribution over outcomes. We identify two main perspectives that drive starkly different evaluation protocols. The first treats predictive probability as an indication of model confidence; the second as an indication of human label variation. We discuss their merits and limitations, and take the position that both are crucial for trustworthy and fair NLP systems, but that exploiting a single predictive distribution is limiting. We recommend tools and highlight exciting directions towards models with disentangled representations of uncertainty about predictions and uncertainty about human labels.

Interpreting Predictive Probabilities: Model Confidence or Human Label Variation?

TL;DR

This position paper analyzes two interpretations of predictive probabilities in NLP: model confidence (calibration-related uncertainty about model error) and human-label variation (uncertainty about outcomes due to annotator perspectives). It argues that both sources are essential for trustworthy and fair NLP, but a single predictive distribution is insufficient to capture them all. The authors review calibration and human-annotation perspectives, discuss their merits and limitations, and connect them to aleatoric and epistemic uncertainty. They propose directions for disentangled uncertainty representations, such as separate error-prediction modules, Bayesian and conformal methods, and improved annotator data practices to enhance both trust and fairness in NLP systems.

Abstract

With the rise of increasingly powerful and user-facing NLP systems, there is growing interest in assessing whether they have a good representation of uncertainty by evaluating the quality of their predictive distribution over outcomes. We identify two main perspectives that drive starkly different evaluation protocols. The first treats predictive probability as an indication of model confidence; the second as an indication of human label variation. We discuss their merits and limitations, and take the position that both are crucial for trustworthy and fair NLP systems, but that exploiting a single predictive distribution is limiting. We recommend tools and highlight exciting directions towards models with disentangled representations of uncertainty about predictions and uncertainty about human labels.
Paper Structure (14 sections, 3 equations)