Table of Contents
Fetching ...

Benchmark Transparency: Measuring the Impact of Data on Evaluation

Venelin Kovatchev, Matthew Lease

TL;DR

Benchmark Transparency introduces an automated, multi-dimensional framework to quantify how data distribution across six dimensions affects NLP model evaluation. By employing disproportional stratified sampling and bootstrapped baselines across SQUAD and MNLI, the study shows data distribution can cause substantial absolute and relative performance changes, often exceeding metric effects. It further proposes a dataset similarity vector based on Standardized Mean Differences and demonstrates that a simple linear model can predict out-of-domain performance from in-domain observations, improving OOD estimates. The work offers scalable data-centric evaluation tools, reveals empirical independence among dimensions, and provides a path toward more reliable, transparent NLP benchmarks with practical implications for model diagnostics and benchmark design.

Abstract

In this paper we present an exploratory research on quantifying the impact that data distribution has on the performance and evaluation of NLP models. We propose an automated framework that measures the data point distribution across 6 different dimensions: ambiguity, difficulty, discriminability, length, noise, and perplexity. We use disproportional stratified sampling to measure how much the data distribution affects absolute (Acc/F1) and relative (Rank) model performance. We experiment on 2 different datasets (SQUAD and MNLI) and test a total of 135 different models (125 on SQUAD and 10 on MNLI). We demonstrate that without explicit control of the data distribution, standard evaluation frameworks are inconsistent and unreliable. We find that the impact of the data is statistically significant and is often larger than the impact of changing the metric. In a second set of experiments, we demonstrate that the impact of data on evaluation is not just observable, but also predictable. We propose to use benchmark transparency as a method for comparing datasets and quantifying the similarity between them. We find that the ``dataset similarity vector'' can be used to predict how well a model generalizes out of distribution.

Benchmark Transparency: Measuring the Impact of Data on Evaluation

TL;DR

Benchmark Transparency introduces an automated, multi-dimensional framework to quantify how data distribution across six dimensions affects NLP model evaluation. By employing disproportional stratified sampling and bootstrapped baselines across SQUAD and MNLI, the study shows data distribution can cause substantial absolute and relative performance changes, often exceeding metric effects. It further proposes a dataset similarity vector based on Standardized Mean Differences and demonstrates that a simple linear model can predict out-of-domain performance from in-domain observations, improving OOD estimates. The work offers scalable data-centric evaluation tools, reveals empirical independence among dimensions, and provides a path toward more reliable, transparent NLP benchmarks with practical implications for model diagnostics and benchmark design.

Abstract

In this paper we present an exploratory research on quantifying the impact that data distribution has on the performance and evaluation of NLP models. We propose an automated framework that measures the data point distribution across 6 different dimensions: ambiguity, difficulty, discriminability, length, noise, and perplexity. We use disproportional stratified sampling to measure how much the data distribution affects absolute (Acc/F1) and relative (Rank) model performance. We experiment on 2 different datasets (SQUAD and MNLI) and test a total of 135 different models (125 on SQUAD and 10 on MNLI). We demonstrate that without explicit control of the data distribution, standard evaluation frameworks are inconsistent and unreliable. We find that the impact of the data is statistically significant and is often larger than the impact of changing the metric. In a second set of experiments, we demonstrate that the impact of data on evaluation is not just observable, but also predictable. We propose to use benchmark transparency as a method for comparing datasets and quantifying the similarity between them. We find that the ``dataset similarity vector'' can be used to predict how well a model generalizes out of distribution.
Paper Structure (39 sections, 6 figures, 4 tables, 2 algorithms)

This paper contains 39 sections, 6 figures, 4 tables, 2 algorithms.

Figures (6)

  • Figure 1: The impact of data distribution on model F1. We report the $\delta$ in F1 caused by re-sampling the test set across each dimension. We report the mean $\delta$ of 125 models on SQUAD. We include random baseline and the impact of changing the "metric" from F1 to "exact".
  • Figure 2: Comparing datasets using benchmark transparency. We measure the data distribution and obtain a "dataset similarity vector". The vector can successfully predict the out-of-distribution change of model performance.
  • Figure 3: Normalized data distribution of all six dimensions for SQUAD and MultiNLI
  • Figure 4: Impact of different data features on model performance (F1) for SQUAD and MultiNLI. On each sub-figure we plot the aggregated change in F1 of all different models (colored shape) as we increase the feature intensity (e.g., as instances become more difficult). The gray region represents the expected random variance at p < 0.05.
  • Figure 5: The average SMD between the full SQUAD dataset and different subsets by topic. Dotted lines -- the average SMD between SQUAD and random uniform sub-samples of itself at size 5%, 10%, and 20%.
  • ...and 1 more figures