Table of Contents
Fetching ...

Quantifying Ambiguity in Categorical Annotations: A Measure and Statistical Inference Framework

Christopher Klugmann, Daniel Kondermann

TL;DR

This paper tackles the challenge of quantifying aleatoric uncertainty in categorical annotations by introducing a novel ambiguity measure amb(q) that explicitly accounts for a 'can't solve' option and annotator disagreement. Grounded in quadratic entropy, amb(q) partitions uncertainty into solvability and category indistinguishability, and is extended by a normalized variant tilde{amb}(q) for consistent interpretation across settings. The authors develop a complete statistical framework under a Dirichlet–multinomial model, deriving closed-form moments for amb and tilde{amb} and outlining posterior sampling to quantify epistemic uncertainty about ambiguity. They compare their measures to a literature baseline amb_0, demonstrate estimation bias and consistency of frequentist plug-in estimators, and illustrate practical utility for dataset quality assessment, stratified benchmarking, and active learning. The framework provides actionable, probability-valued signals for annotator agreement and task difficulty, while remaining interpretable and extensible to richer models in future work.

Abstract

Human-generated categorical annotations frequently produce empirical response distributions (soft labels) that reflect ambiguity rather than simple annotator error. We introduce an ambiguity measure that maps a discrete response distribution to a scalar in the unit interval, designed to quantify aleatoric uncertainty in categorical tasks. The measure bears a close relationship to quadratic entropy (Gini-style impurity) but departs from those indices by treating an explicit "can't solve" category asymmetrically, thereby separating uncertainty arising from class-level indistinguishability from uncertainty due to explicit unresolvability. We analyze the measure's formal properties and contrast its behavior with a representative ambiguity measure from the literature. Moving beyond description, we develop statistical tools for inference: we propose frequentist point estimators for population ambiguity and derive the Bayesian posterior over ambiguity induced by Dirichlet priors on the underlying probability vector, providing a principled account of epistemic uncertainty. Numerical examples illustrate estimation, calibration, and practical use for dataset-quality assessment and downstream machine-learning workflows.

Quantifying Ambiguity in Categorical Annotations: A Measure and Statistical Inference Framework

TL;DR

This paper tackles the challenge of quantifying aleatoric uncertainty in categorical annotations by introducing a novel ambiguity measure amb(q) that explicitly accounts for a 'can't solve' option and annotator disagreement. Grounded in quadratic entropy, amb(q) partitions uncertainty into solvability and category indistinguishability, and is extended by a normalized variant tilde{amb}(q) for consistent interpretation across settings. The authors develop a complete statistical framework under a Dirichlet–multinomial model, deriving closed-form moments for amb and tilde{amb} and outlining posterior sampling to quantify epistemic uncertainty about ambiguity. They compare their measures to a literature baseline amb_0, demonstrate estimation bias and consistency of frequentist plug-in estimators, and illustrate practical utility for dataset quality assessment, stratified benchmarking, and active learning. The framework provides actionable, probability-valued signals for annotator agreement and task difficulty, while remaining interpretable and extensible to richer models in future work.

Abstract

Human-generated categorical annotations frequently produce empirical response distributions (soft labels) that reflect ambiguity rather than simple annotator error. We introduce an ambiguity measure that maps a discrete response distribution to a scalar in the unit interval, designed to quantify aleatoric uncertainty in categorical tasks. The measure bears a close relationship to quadratic entropy (Gini-style impurity) but departs from those indices by treating an explicit "can't solve" category asymmetrically, thereby separating uncertainty arising from class-level indistinguishability from uncertainty due to explicit unresolvability. We analyze the measure's formal properties and contrast its behavior with a representative ambiguity measure from the literature. Moving beyond description, we develop statistical tools for inference: we propose frequentist point estimators for population ambiguity and derive the Bayesian posterior over ambiguity induced by Dirichlet priors on the underlying probability vector, providing a principled account of epistemic uncertainty. Numerical examples illustrate estimation, calibration, and practical use for dataset-quality assessment and downstream machine-learning workflows.

Paper Structure

This paper contains 28 sections, 3 theorems, 75 equations, 7 figures, 1 table.

Key Result

Lemma 1

Let $\widehat{\mathrm{amb}}_n = \mathrm{amb}(\hat{\vb{q}}_n)$ denote the plug-in estimator of ambiguity based on a sample of size $n$, where $\mathrm{amb}(\cdot)$ is the ambiguity functional defined in Equation eq:def_amb_final, and $\hat{\vb{q}}_n$ is the empirical distribution of the observed labe where $\vb{q}$ is the true underlying probability vector. Then, for all non-degenerate $\vb{q}$, th

Figures (7)

  • Figure 1: A labeling task is performed by a group of annotators (the crowd), resulting in a distribution over possible answers. In this example, the distribution is derived from $20$ individual responses, reflecting high uncertainty regarding a single correct label. This observed response distribution embodies both epistemic uncertainty---due to the finite sample of annotations---and aleatoric uncertainty, which captures the intrinsic, irreducible uncertainty linked to the task and the annotators. The latter is what we define as ambiguity, the central focus of this work. Using Bayesian inference, we estimate the posterior distribution over possible ambiguity values, quantifying how compatible each is with the observed data.
  • Figure 2: Eight examples of dichotomous distributions (including the cs category) with varying degrees of ambiguity (left). The table (right) shows the ambiguity values of distributions (1) - (8) for the new ($\mathrm{amb}$), old ($\mathrm{amb}_0$), and modified ($\widetilde{\mathrm{amb}}$) ambiguity measures.
  • Figure 3: Eight examples of categorical distributions (including the cs category) with varying degrees of ambiguity (left). The table (right) shows the ambiguity values of distributions (1) - (8) for the new ($\mathrm{amb}$), old ($\mathrm{amb}_0$), and modified ($\widetilde{\mathrm{amb}}$) ambiguity measures.
  • Figure 4: Example of a Dirichlet distribution over $C=3$ categories, shown in the left half of the image as a heatmap over the standard simplex. The right half of the image displays the corresponding univariate distribution of the normalized entropy for this distribution.
  • Figure 5: Bias of different estimators of ambiguity as a function of the number of observed answers. Shown are the plug-in estimator and Bayesian estimators (posterior mean and mode) with symmetric Dirichlet priors parameterized by $\beta \in \{0.5, 1.0\}$. Each panel corresponds to a different underlying categorical distribution $\vb{q}_0$, as indicated in the titles. The left panel shows a relatively balanced distribution over two proper categories and one residual cant_solve category. In contrast, the right panel reflects a highly skewed distribution with most mass concentrated on a single category. In all cases, the estimators are biased but consistent: the bias decreases with increasing sample size. For small numbers of answers, Bayesian estimators are visibly influenced by the prior. Notably, the plug-in estimator underestimates the true ambiguity in expectation and converges from below.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3: Asymptotics at $a\to1^{-}$
  • proof