Table of Contents
Fetching ...

Your Model is Overconfident, and Other Lies We Tell Ourselves

Timothee Mickus, Aman Sinha, Raúl Vázquez

TL;DR

This work investigates how data complexity, annotator disagreement, and model uncertainty interact in NLP. By applying 11 indicators across 29 models on ChaosNLI and DynaSent, it shows that human-based assessments of difficulty do not map linearly to model-based metrics, and that reference-free indicators can conflate model success and failure. Conformal prediction and entropy-based measures partially align with human variation, but gaps remain, especially when models primarily agree with non-consensus labels. The study underscores the need to disentangle sources of uncertainty, advocate for soft-label training, and call for replication across tasks and languages to better quantify data complexity and its impact on evaluation and model improvement.

Abstract

The difficulty intrinsic to a given example, rooted in its inherent ambiguity, is a key yet often overlooked factor in evaluating neural NLP models. We investigate the interplay and divergence among various metrics for assessing intrinsic difficulty, including annotator dissensus, training dynamics, and model confidence. Through a comprehensive analysis using 29 models on three datasets, we reveal that while correlations exist among these metrics, their relationships are neither linear nor monotonic. By disentangling these dimensions of uncertainty, we aim to refine our understanding of data complexity and its implications for evaluating and improving NLP models.

Your Model is Overconfident, and Other Lies We Tell Ourselves

TL;DR

This work investigates how data complexity, annotator disagreement, and model uncertainty interact in NLP. By applying 11 indicators across 29 models on ChaosNLI and DynaSent, it shows that human-based assessments of difficulty do not map linearly to model-based metrics, and that reference-free indicators can conflate model success and failure. Conformal prediction and entropy-based measures partially align with human variation, but gaps remain, especially when models primarily agree with non-consensus labels. The study underscores the need to disentangle sources of uncertainty, advocate for soft-label training, and call for replication across tasks and languages to better quantify data complexity and its impact on evaluation and model improvement.

Abstract

The difficulty intrinsic to a given example, rooted in its inherent ambiguity, is a key yet often overlooked factor in evaluating neural NLP models. We investigate the interplay and divergence among various metrics for assessing intrinsic difficulty, including annotator dissensus, training dynamics, and model confidence. Through a comprehensive analysis using 29 models on three datasets, we reveal that while correlations exist among these metrics, their relationships are neither linear nor monotonic. By disentangling these dimensions of uncertainty, we aim to refine our understanding of data complexity and its implications for evaluating and improving NLP models.

Paper Structure

This paper contains 37 sections, 17 equations, 1 figure, 12 tables.

Figures (1)

  • Figure 1: Example of joint distribution between a reference-free and a reference-dependent indicator (SNLI 1B pool, $\mathbb{M}_{\mathrm{CP,\ }\alpha=0.05}$ vs. $\mathbb{M}_{1^\mathrm{st}\mathrm{\ layer}}^\mathrm{ref}$). Datapoints in orange are misclassified by $50\%$ of the pool, blue datapoints aren't. See also \ref{['tab:snli:corrtable-models-partitioned', 'tab:mnli:corrtable-models-partitioned']} (§\ref{['adx:sup-res:breakdown along success']}).