Table of Contents
Fetching ...

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

Jacob Haimes, Cenny Wenner, Kunvar Thaman, Vassil Tashev, Clement Neo, Esben Kran, Jason Schreiber

TL;DR

A systematic methodology for retrospectively constructing a holdout dataset for a target dataset, demonstrating the statistical indistinguishability of this retro-holdout dataset, and comparing LLMs on the two datasets to quantify the performance gap due to the dataset's public availability is introduced.

Abstract

The training data for many Large Language Models (LLMs) is contaminated with test data. This means that public benchmarks used to assess LLMs are compromised, suggesting a performance gap between benchmark scores and actual capabilities. Ideally, a private holdout set could be used to accurately verify scores. Unfortunately, such datasets do not exist for most benchmarks, and post-hoc construction of sufficiently similar datasets is non-trivial. To address these issues, we introduce a systematic methodology for (i) retrospectively constructing a holdout dataset for a target dataset, (ii) demonstrating the statistical indistinguishability of this retro-holdout dataset, and (iii) comparing LLMs on the two datasets to quantify the performance gap due to the dataset's public availability. Applying these methods to TruthfulQA, we construct and release Retro-Misconceptions, on which we evaluate twenty LLMs and find that some have inflated scores by as much as 16 percentage points. Our results demonstrate that public benchmark scores do not always accurately assess model properties, and underscore the importance of improved data practices in the field.

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

TL;DR

A systematic methodology for retrospectively constructing a holdout dataset for a target dataset, demonstrating the statistical indistinguishability of this retro-holdout dataset, and comparing LLMs on the two datasets to quantify the performance gap due to the dataset's public availability is introduced.

Abstract

The training data for many Large Language Models (LLMs) is contaminated with test data. This means that public benchmarks used to assess LLMs are compromised, suggesting a performance gap between benchmark scores and actual capabilities. Ideally, a private holdout set could be used to accurately verify scores. Unfortunately, such datasets do not exist for most benchmarks, and post-hoc construction of sufficiently similar datasets is non-trivial. To address these issues, we introduce a systematic methodology for (i) retrospectively constructing a holdout dataset for a target dataset, (ii) demonstrating the statistical indistinguishability of this retro-holdout dataset, and (iii) comparing LLMs on the two datasets to quantify the performance gap due to the dataset's public availability. Applying these methods to TruthfulQA, we construct and release Retro-Misconceptions, on which we evaluate twenty LLMs and find that some have inflated scores by as much as 16 percentage points. Our results demonstrate that public benchmark scores do not always accurately assess model properties, and underscore the importance of improved data practices in the field.

Paper Structure

This paper contains 37 sections, 1 equation, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Visualization of our methodology. The left panel summarizes the process for constructing a retro-holdout dataset, while the right panel illustrates how to leverage such a dataset to quantify benchmark inflation.
  • Figure 2: Model accuracy on Retro-Misconceptions vs. TruthfulQA (Misconceptions, Non-Adversarial) for multiple pre-release models. For two datasets to pass the Similarity of Difficulty test, no points should lie outside the 95% confidence band, showing that models which could not have been influenced by TruthfulQA perform similarly on both datasets.
  • Figure 3: Example output from the Internal Cosine Similarity Distribution tool. This specific plot indicates that entries within the target were systematically more similar by a small amount, which led the team to further scrutinize word frequencies.
  • Figure 4: Model performance gaps on TruthfulQA vs our retro-holdout. Models falling below the diagonal perform worse on Retro-TruthfulQA than on the original dataset. Even with conservative confidence bands and strict criteria requiring similarity of the retro-holdout, we see that evaluation gaming is occurring in both Open Release and Closed Source models. An additional visualization of these data is provided in \ref{['fig:gap']}.
  • Figure 5: Model performance gaps on TruthfulQA, quantified by the difference in a model's benchmark score on TruthfulQA (Misconceptions, Non-Adversarial), and Retro-Misconceptions. Language model names, including version specifications, are shown on the left of the plot, and Fisher's Exact Test $p$-values between the models score on Retro-Misconceptions and TruthfulQA are given on the right. Entries marked with * have a $p$-value less than $0.05$. Statistical uncertainty is visualized with 1-sigma error bars.
  • ...and 3 more figures