Table of Contents
Fetching ...

Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information

Kawin Ethayarajh, Yejin Choi, Swabha Swayamdipta

TL;DR

The paper introduces a principled framework for understanding NLP dataset difficulty through $\mathcal{V}$-usable information, $I_\mathcal{V}(X \to Y)$, and its per-instance analogue, pvi, which quantify how much information about labels is usable by a model family $\mathcal{V}$. It shows that different datasets offer different amounts of usable information, and that pvi can identify easy and hard instances, mislabelled examples, and dataset artefacts via input transformations and slicing. Empirically, model performance tracks $I_\mathcal{V}$, overfitting is more sensitively detected by $I_\mathcal{V}$ than held-out accuracy, and pvi estimates are stable across models, seeds, and epochs, aligning with human difficulty judgments in many cases. The framework also provides practical tools for artefact discovery (token-level signals, attribute-based slices) and paves the way for dataset design and cross-domain extensions of NLP benchmarks.

Abstract

Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. However, this comparison provides little understanding of how difficult each instance in a given distribution is, or what attributes make the dataset difficult for a given model. To address these questions, we frame dataset difficulty -- w.r.t. a model $\mathcal{V}$ -- as the lack of $\mathcal{V}$-$\textit{usable information}$ (Xu et al., 2019), where a lower value indicates a more difficult dataset for $\mathcal{V}$. We further introduce $\textit{pointwise $\mathcal{V}$-information}$ (PVI) for measuring the difficulty of individual instances w.r.t. a given distribution. While standard evaluation metrics typically only compare different models for the same dataset, $\mathcal{V}$-$\textit{usable information}$ and PVI also permit the converse: for a given model $\mathcal{V}$, we can compare different datasets, as well as different instances/slices of the same dataset. Furthermore, our framework allows for the interpretability of different input attributes via transformations of the input, which we use to discover annotation artefacts in widely-used NLP benchmarks.

Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information

TL;DR

The paper introduces a principled framework for understanding NLP dataset difficulty through -usable information, , and its per-instance analogue, pvi, which quantify how much information about labels is usable by a model family . It shows that different datasets offer different amounts of usable information, and that pvi can identify easy and hard instances, mislabelled examples, and dataset artefacts via input transformations and slicing. Empirically, model performance tracks , overfitting is more sensitively detected by than held-out accuracy, and pvi estimates are stable across models, seeds, and epochs, aligning with human difficulty judgments in many cases. The framework also provides practical tools for artefact discovery (token-level signals, attribute-based slices) and paves the way for dataset design and cross-domain extensions of NLP benchmarks.

Abstract

Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. However, this comparison provides little understanding of how difficult each instance in a given distribution is, or what attributes make the dataset difficult for a given model. To address these questions, we frame dataset difficulty -- w.r.t. a model -- as the lack of - (Xu et al., 2019), where a lower value indicates a more difficult dataset for . We further introduce \mathcal{V} (PVI) for measuring the difficulty of individual instances w.r.t. a given distribution. While standard evaluation metrics typically only compare different models for the same dataset, - and PVI also permit the converse: for a given model , we can compare different datasets, as well as different instances/slices of the same dataset. Furthermore, our framework allows for the interpretability of different input attributes via transformations of the input, which we use to discover annotation artefacts in widely-used NLP benchmarks.

Paper Structure

This paper contains 42 sections, 7 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: The Stanford NLI dataset contains more BERT-usable information than the MultiNLI and CoLA datasets, making it easier for BERT-base. Above, the distribution of instance difficulty (pvi) in the held-out sets for each; dotted lines denote the average pvi.
  • Figure 2: Comparing the $\mathcal{V}$-usable information estimate to accuracy in SNLI. In the first three epochs, estimates on the test set are similar across all models (top), but due to over-fitting, the estimates diverge and decline. The test accuracy (bottom) for each model loosely tracks the $\mathcal{V}$-information estimate for that model, since extracting information makes prediction easier.
  • Figure 3: The distribution of pvi for correctly and incorrectly predicted instances in each dataset. Note that the point at which instances start being incorrectly predicted is similar across datasets ($\sim$ 0.5 bits). In contrast, because the label space is different across CoLA and the other two datasets, such a comparison could not be made with a performance-based metric such as accuracy.
  • Figure 4: The amount of $\mathcal{V}$-usable information contained in different input attributes about the gold labels in SNLI. The token identity alone (regardless of order) provides most of the information for all models (see shuffled). The premise, which can be shared by multiple instances, is useless alone; the hypothesis, which is unique to an instance, is quite useful even without a premise, suggesting it may contain annotation artefacts.
  • Figure 5: The mean pvi of SNLI instances according to BERT-base, broken down by the overlap length (i.e., the number of tokens shared by the hypothesis and premise). Entailment examples with no overlap are the most difficult (i.e., lowest mean pvi).
  • ...and 8 more figures

Theorems & Definitions (3)

  • Definition 2.1
  • Definition 2.2
  • Definition 3.1: Pointwise $\mathcal{V}$-Information