Data Checklist: On Unit-Testing Datasets with Usable Information

Heidi C. Zhang; Shabnam Behzad; Kawin Ethayarajh; Dan Jurafsky

Data Checklist: On Unit-Testing Datasets with Usable Information

Heidi C. Zhang, Shabnam Behzad, Kawin Ethayarajh, Dan Jurafsky

TL;DR

This work introduces data checklists, a principled, information-theoretic framework based on $\mathcal{V}$-information to audit datasets for artifacts before model evaluation. By mapping dataset questions to a taxonomy of 10 unit tests and providing a practical library for sequencing outputs, the authors uncover both known and novel artifacts in language tasks and preference alignment data. They demonstrate that pointwise and conditional $\mathcal{V}$-information quantify per-instance difficulty and feature dependence, enabling effective data filtering that improves learning efficiency and performance with less data. The findings show artifacts such as premise-hypothesis overlap, length-based biases, and profanity signals in various datasets, and illustrate how PVIs can guide targeted data curation to enhance alignment and safety in language models.

Abstract

Model checklists (Ribeiro et al., 2020) have emerged as a useful tool for understanding the behavior of LLMs, analogous to unit-testing in software engineering. However, despite datasets being a key determinant of model behavior, evaluating datasets, e.g., for the existence of annotation artifacts, is largely done ad hoc, once a problem in model behavior has already been found downstream. In this work, we take a more principled approach to unit-testing datasets by proposing a taxonomy based on the V-information literature. We call a collection of such unit tests a data checklist. Using a checklist, not only are we able to recover known artifacts in well-known datasets such as SNLI, but we also discover previously unknown artifacts in preference datasets for LLM alignment. Data checklists further enable a new kind of data filtering, which we use to improve the efficacy and data efficiency of preference alignment.

Data Checklist: On Unit-Testing Datasets with Usable Information

TL;DR

This work introduces data checklists, a principled, information-theoretic framework based on

-information to audit datasets for artifacts before model evaluation. By mapping dataset questions to a taxonomy of 10 unit tests and providing a practical library for sequencing outputs, the authors uncover both known and novel artifacts in language tasks and preference alignment data. They demonstrate that pointwise and conditional

-information quantify per-instance difficulty and feature dependence, enabling effective data filtering that improves learning efficiency and performance with less data. The findings show artifacts such as premise-hypothesis overlap, length-based biases, and profanity signals in various datasets, and illustrate how PVIs can guide targeted data curation to enhance alignment and safety in language models.

Abstract

Paper Structure (27 sections, 5 equations, 4 figures, 5 tables)

This paper contains 27 sections, 5 equations, 4 figures, 5 tables.

Introduction
Background
$\mathcal{V}$-information.
Pointwise difficulty.
Conditional $\mathcal{V}$-information.
Data Checklists
Dataset Artifacts
Rediscovering Known Artifacts
Premise-hypothesis overlap in SNLI.
Lexical bias in hate speech detection.
Discovering Novel Artifacts in Preference Datasets
Response length.
Word complexity.
Profane words.
Multilinguality in UltraFeedback.
...and 12 more sections

Figures (4)

Figure 1: PVI values from the applicability and sufficiency tests on SHP with $\Phi$ as the length difference (i.e., how much longer the preferred output is compared to the dispreferred one). Green dots are where length alone makes the correct prediction and blue dots are where it does not. Length-based prediction is correct when the applicability PVI is relatively high and when the sufficiency PVI $\approx 0$, which happens more often when the length difference is greater.
Figure 2: Applicability and insufficiency tests both pass for $\Phi_\text{profanity}$ on HH-harmless ($\epsilon=0.01$). These imply that profane words are useful for aligning models to be more safe (i.e., applicable), but that they are not enough on their own (i.e., insufficient). This is a desirable outcome: if profanity was inapplicable for HH-harmless, it would suggest that an important kind of harmful speech were uncovered; if it were sufficient, it would suggest that HH-harmless had a simplistic view of harm.
Figure 3: DPO reward accuracy on UltraFeedback examples with different PVI intervals (12k examples per interval). Mid and mid-to-high-PVI examples lead to the highest reward accuracies, while the lowest-PVI examples are the least useful. This suggests that overly easy-to-learn and hard-to-learn examples contribute less to generalization than their more temperate counterparts.
Figure :

Data Checklist: On Unit-Testing Datasets with Usable Information

TL;DR

Abstract

Data Checklist: On Unit-Testing Datasets with Usable Information

Authors

TL;DR

Abstract

Table of Contents

Figures (4)