Quantifying Ambiguity in Categorical Annotations: A Measure and Statistical Inference Framework
Christopher Klugmann, Daniel Kondermann
TL;DR
This paper tackles the challenge of quantifying aleatoric uncertainty in categorical annotations by introducing a novel ambiguity measure amb(q) that explicitly accounts for a 'can't solve' option and annotator disagreement. Grounded in quadratic entropy, amb(q) partitions uncertainty into solvability and category indistinguishability, and is extended by a normalized variant tilde{amb}(q) for consistent interpretation across settings. The authors develop a complete statistical framework under a Dirichlet–multinomial model, deriving closed-form moments for amb and tilde{amb} and outlining posterior sampling to quantify epistemic uncertainty about ambiguity. They compare their measures to a literature baseline amb_0, demonstrate estimation bias and consistency of frequentist plug-in estimators, and illustrate practical utility for dataset quality assessment, stratified benchmarking, and active learning. The framework provides actionable, probability-valued signals for annotator agreement and task difficulty, while remaining interpretable and extensible to richer models in future work.
Abstract
Human-generated categorical annotations frequently produce empirical response distributions (soft labels) that reflect ambiguity rather than simple annotator error. We introduce an ambiguity measure that maps a discrete response distribution to a scalar in the unit interval, designed to quantify aleatoric uncertainty in categorical tasks. The measure bears a close relationship to quadratic entropy (Gini-style impurity) but departs from those indices by treating an explicit "can't solve" category asymmetrically, thereby separating uncertainty arising from class-level indistinguishability from uncertainty due to explicit unresolvability. We analyze the measure's formal properties and contrast its behavior with a representative ambiguity measure from the literature. Moving beyond description, we develop statistical tools for inference: we propose frequentist point estimators for population ambiguity and derive the Bayesian posterior over ambiguity induced by Dirichlet priors on the underlying probability vector, providing a principled account of epistemic uncertainty. Numerical examples illustrate estimation, calibration, and practical use for dataset-quality assessment and downstream machine-learning workflows.
