Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information
Kawin Ethayarajh, Yejin Choi, Swabha Swayamdipta
TL;DR
The paper introduces a principled framework for understanding NLP dataset difficulty through $\mathcal{V}$-usable information, $I_\mathcal{V}(X \to Y)$, and its per-instance analogue, pvi, which quantify how much information about labels is usable by a model family $\mathcal{V}$. It shows that different datasets offer different amounts of usable information, and that pvi can identify easy and hard instances, mislabelled examples, and dataset artefacts via input transformations and slicing. Empirically, model performance tracks $I_\mathcal{V}$, overfitting is more sensitively detected by $I_\mathcal{V}$ than held-out accuracy, and pvi estimates are stable across models, seeds, and epochs, aligning with human difficulty judgments in many cases. The framework also provides practical tools for artefact discovery (token-level signals, attribute-based slices) and paves the way for dataset design and cross-domain extensions of NLP benchmarks.
Abstract
Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. However, this comparison provides little understanding of how difficult each instance in a given distribution is, or what attributes make the dataset difficult for a given model. To address these questions, we frame dataset difficulty -- w.r.t. a model $\mathcal{V}$ -- as the lack of $\mathcal{V}$-$\textit{usable information}$ (Xu et al., 2019), where a lower value indicates a more difficult dataset for $\mathcal{V}$. We further introduce $\textit{pointwise $\mathcal{V}$-information}$ (PVI) for measuring the difficulty of individual instances w.r.t. a given distribution. While standard evaluation metrics typically only compare different models for the same dataset, $\mathcal{V}$-$\textit{usable information}$ and PVI also permit the converse: for a given model $\mathcal{V}$, we can compare different datasets, as well as different instances/slices of the same dataset. Furthermore, our framework allows for the interpretability of different input attributes via transformations of the input, which we use to discover annotation artefacts in widely-used NLP benchmarks.
