Table of Contents
Fetching ...

Soft Contamination Means Benchmarks Test Shallow Generalization

Ari Spiesberger, Juan J. Vazquez, Nicky Pochinkov, Tomáš Gavenčiak, Peli Grietzer, Gavin Leech, Nandi Schoots

TL;DR

The paper investigates soft contamination of LLM training data by semantic duplicates and demonstrates that benchmark gains can arise from training data leakage rather than genuine out-of-distribution generalization. Using the open-source Olmo3 model, it shows widespread semantic duplication across major reasoning benchmarks and that finetuning on duplicates can boost benchmark scores, sometimes matching gains expected from true capability improvements. The findings suggest many recent reasoning-benchmark gains are confounded by data contamination, underscoring the need for careful benchmark auditing and ecologically valid evaluation of AI progress. The work also introduces ecologically realistic finetuning experiments to quantify practical contamination effects.

Abstract

If LLM training data is polluted with benchmark test data, then benchmark performance gives biased estimates of out-of-distribution (OOD) generalization. Typical decontamination filters use n-gram matching which fail to detect semantic duplicates: sentences with equivalent (or near-equivalent) content that are not close in string space. We study this soft contamination of training data by semantic duplicates. Among other experiments, we embed the Olmo3 training corpus and find that: 1) contamination remains widespread, e.g. we find semantic duplicates for 78% of CodeForces and exact duplicates for 50% of ZebraLogic problems; 2) including semantic duplicates of benchmark data in training does improve benchmark performance; and 3) when finetuning on duplicates of benchmark datapoints, performance also improves on truly-held-out datapoints from the same benchmark. We argue that recent benchmark gains are thus confounded: the prevalence of soft contamination means gains reflect both genuine capability improvements and the accumulation of test data and effective test data in growing training corpora.

Soft Contamination Means Benchmarks Test Shallow Generalization

TL;DR

The paper investigates soft contamination of LLM training data by semantic duplicates and demonstrates that benchmark gains can arise from training data leakage rather than genuine out-of-distribution generalization. Using the open-source Olmo3 model, it shows widespread semantic duplication across major reasoning benchmarks and that finetuning on duplicates can boost benchmark scores, sometimes matching gains expected from true capability improvements. The findings suggest many recent reasoning-benchmark gains are confounded by data contamination, underscoring the need for careful benchmark auditing and ecologically valid evaluation of AI progress. The work also introduces ecologically realistic finetuning experiments to quantify practical contamination effects.

Abstract

If LLM training data is polluted with benchmark test data, then benchmark performance gives biased estimates of out-of-distribution (OOD) generalization. Typical decontamination filters use n-gram matching which fail to detect semantic duplicates: sentences with equivalent (or near-equivalent) content that are not close in string space. We study this soft contamination of training data by semantic duplicates. Among other experiments, we embed the Olmo3 training corpus and find that: 1) contamination remains widespread, e.g. we find semantic duplicates for 78% of CodeForces and exact duplicates for 50% of ZebraLogic problems; 2) including semantic duplicates of benchmark data in training does improve benchmark performance; and 3) when finetuning on duplicates of benchmark datapoints, performance also improves on truly-held-out datapoints from the same benchmark. We argue that recent benchmark gains are thus confounded: the prevalence of soft contamination means gains reflect both genuine capability improvements and the accumulation of test data and effective test data in growing training corpora.
Paper Structure (35 sections, 10 figures, 9 tables)

This paper contains 35 sections, 10 figures, 9 tables.

Figures (10)

  • Figure 1: On the y-axis we plot the following statistic: for each ZebraLogic benchmark datapoint we check among the top 10 highest cosine similarity training datapoints if any of those samples is an exact duplicate, we then calculate the proportion of benchmark datapoints (of a given grid size) that have at least one exact duplicate. On the x-axis we plot puzzle grid size.
  • Figure 2: Relationship between cosine similarity level and semantic duplication. For each benchmark datapoint we sample 100 matches from the top 0.1% cosine similarity matches in the training data. On the x-axis we plot the cosine similarity. On the y-axis we plot the percentage of cosine similarity matches at this level that are true semantic duplicates. The opaque graph shows the confidence interval: this interval widens when there are fewer samples of a given cosine similarity level. In red we plot semantic duplicates inclusive of exact duplicates, and in blue exclusive.
  • Figure 3: Occurence by elo. On the y-axis we plot the following statistic: for each benchmark datapoint we check among the top 100 cosine similarity training datapoints if any of those samples is a semantic duplicate, we then calculate the proportion of all benchmark datapoints that have at least one semantic duplicate. We plot Elo scores on the x-axis.
  • Figure 4: Occurence by training dataset. On the y-axis: for each benchmark datapoint we check among the top 100 cosine similarity training datapoints if any of those samples is a semantic duplicate, we then calculate the proportion of all benchmark datapoints that have at least one semantic duplicate. On the x-axis we plot the different training datasets. The lines show the standard deviation.
  • Figure 5: Occurence by number of cosine similarity matches investigated. We take the number of semantic duplicates evaluated by being at top-n at each of our dataset comparisons.
  • ...and 5 more figures