Table of Contents
Fetching ...

Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?

Aaditya K. Singh, Muhammed Yusuf Kocyigit, Andrew Poulton, David Esiobu, Maria Lomeli, Gergely Szilvasy, Dieuwke Hupkes

TL;DR

This paper tackles the challenge of evaluation data contamination in large language models by proposing the Contamination Threshold Analysis Method (ConTAM) and the Estimated Performance Gain (EPG) metric to empirically ground contamination in downstream effects. It systematically compares four contamination metrics across 13 benchmarks and 7 models from two pre-training corpora, finding that the longest contaminated substring (longest-match) generally yields the most robust signal, while hyperparameters like $n$ and $mincount$ critically shape detection. The results reveal that evaluation contamination can significantly inflate benchmark scores, with effects that scale with model size and vary by task, challenging prior reports of limited impact. The study also provides practical recommendations for contamination analysis, emphasizing model-specific thresholding, multiple metrics, and the use of ConTAM plots to convey results. Overall, the work offers a rigorous, empirical framework for measuring and interpreting evaluation data contamination in LLM benchmarking, with actionable guidance for researchers and practitioners.

Abstract

Hampering the interpretation of benchmark scores, evaluation data contamination has become a growing concern in the evaluation of LLMs, and an active area of research studies its effects. While evaluation data contamination is easily understood intuitively, it is surprisingly difficult to define precisely which samples should be considered contaminated and, consequently, how it impacts benchmark scores. We propose that these questions should be addressed together and that contamination metrics can be assessed based on whether models benefit from the examples they mark contaminated. We propose a novel analysis method called ConTAM, and show with a large scale survey of existing and novel n-gram based contamination metrics across 13 benchmarks and 7 models from 2 different families that ConTAM can be used to better understand evaluation data contamination and its effects. We find that contamination may have a much larger effect than reported in recent LLM releases and benefits models differently at different scales. We also find that considering only the longest contaminated substring provides a better signal than considering a union of all contaminated substrings, and that doing model and benchmark specific threshold analysis greatly increases the specificity of the results. Lastly, we investigate the impact of hyperparameter choices, finding that, among other things, both using larger values of n and disregarding matches that are infrequent in the pre-training data lead to many false negatives. With ConTAM, we provide a method to empirically ground evaluation data contamination metrics in downstream effects. With our exploration, we shed light on how evaluation data contamination can impact LLMs and provide insight into the considerations important when doing contamination analysis. We end our paper by discussing these in more detail and providing concrete suggestions for future work.

Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?

TL;DR

This paper tackles the challenge of evaluation data contamination in large language models by proposing the Contamination Threshold Analysis Method (ConTAM) and the Estimated Performance Gain (EPG) metric to empirically ground contamination in downstream effects. It systematically compares four contamination metrics across 13 benchmarks and 7 models from two pre-training corpora, finding that the longest contaminated substring (longest-match) generally yields the most robust signal, while hyperparameters like and critically shape detection. The results reveal that evaluation contamination can significantly inflate benchmark scores, with effects that scale with model size and vary by task, challenging prior reports of limited impact. The study also provides practical recommendations for contamination analysis, emphasizing model-specific thresholding, multiple metrics, and the use of ConTAM plots to convey results. Overall, the work offers a rigorous, empirical framework for measuring and interpreting evaluation data contamination in LLM benchmarking, with actionable guidance for researchers and practitioners.

Abstract

Hampering the interpretation of benchmark scores, evaluation data contamination has become a growing concern in the evaluation of LLMs, and an active area of research studies its effects. While evaluation data contamination is easily understood intuitively, it is surprisingly difficult to define precisely which samples should be considered contaminated and, consequently, how it impacts benchmark scores. We propose that these questions should be addressed together and that contamination metrics can be assessed based on whether models benefit from the examples they mark contaminated. We propose a novel analysis method called ConTAM, and show with a large scale survey of existing and novel n-gram based contamination metrics across 13 benchmarks and 7 models from 2 different families that ConTAM can be used to better understand evaluation data contamination and its effects. We find that contamination may have a much larger effect than reported in recent LLM releases and benefits models differently at different scales. We also find that considering only the longest contaminated substring provides a better signal than considering a union of all contaminated substrings, and that doing model and benchmark specific threshold analysis greatly increases the specificity of the results. Lastly, we investigate the impact of hyperparameter choices, finding that, among other things, both using larger values of n and disregarding matches that are infrequent in the pre-training data lead to many false negatives. With ConTAM, we provide a method to empirically ground evaluation data contamination metrics in downstream effects. With our exploration, we shed light on how evaluation data contamination can impact LLMs and provide insight into the considerations important when doing contamination analysis. We end our paper by discussing these in more detail and providing concrete suggestions for future work.

Paper Structure

This paper contains 76 sections, 1 equation, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Benchmark analysis with threshold and ConTAM plots. The threshold selected for a contamination metric has a large impact on the predicted amount of contamination. In the top row, we show how varying the threshold leads to different EPGs for a given model and benchmark pair. At high thresholds, true positives may not be included, despite their effect on EPG; at very low thresholds, we instead see false positives that lower the EPG. In the bottom row, we showcase our ConTAM plots, where the x-axis is instead the % of data marked contaminated at a given threshold. These plots offer a way of comparing different contamination metrics, as they solely analyse the ordering of data points enforced by a metric, rather than specific metric scores assigned to data points. We show correspondence between points in the top row and bottom row and provide two examples of a true positive and a false positive for Llama 65B on GSM8k. Optimal and zero thresholds are shown with vertical dotted lines. All lines correspond to scores from the longest-match metric with $n=8$ and $skip\_budget=0$.
  • Figure 2: Example profiles for different metrics. (a) We most commonly observe cases in which all methods perform roughly the same or (b) longest-match clearly performs best. We also see more exceptional patterns: (c) longest-match performs worse than other methods, except at threshold 0; and (d) longest-match and token-extend perform equally well, outperforming ngram-match and token-match. Full results for all model-benchmark pairs can be found in \ref{['fig:appx_comp_method']}. All methods depicted are run with their "optimal" hyperparameters ($n=8, mincount=1, skip\_budget=0$) -- see \ref{['sec:analysis']} for justification.
  • Figure 3: Per model % contaminated and EPG values across benchmarks. Percentage of the dataset marked contaminated (a) and corresponding EPG (b) and average gain per % contamination (c) for each of the model-benchmark pairs we considered in our study, according to the best contamination metric (see \ref{['tab:optimal_thresh']}). Optimal thresholds are selected separately for each model-benchmark pair. For COPA, DM Contest, and SiQA no significant EPG was found, and they are therefore omitted from the plot.
  • Figure 4: Percent contaminated, EPG and gain per % contaminated for largest models. We show the percent of each dataset marked contaminated (a), the corresponding EPG (b), and the gain per % contaminated (c) for the largest two model sizes. With the exception of the benchmark Natural Questions, the Llama 1 corpus has substantially more contamination than the Pile, and also contamination scores are higher. Furthermore, with the exception of TriviaQA, the larger (and better) Llama models are also better able to exploit contamination, as indicated by the higher gain per % contaminated plotted in (c).
  • Figure 5: Example scaling behaviours for Llama models. Dotted lines indicate optimal contamination thresholds for each model size, chosen to maximize z-score. In (a), we see a shift primarily upwards, suggesting that larger models are better able to exploit contamination. In (b), instead, we see how contamination curves shift up and to the right as the model size grows, indicating that larger models benefit from examples that look like false positives for smaller models. Lastly, in (c), we see the opposite case, where smaller models benefit more from contamination. We hypothesis that this is because larger models have high scores even on the clean partitions, and there is thus little room for improvement.
  • ...and 12 more figures