Soft Contamination Means Benchmarks Test Shallow Generalization

Ari Spiesberger; Juan J. Vazquez; Nicky Pochinkov; Tomáš Gavenčiak; Peli Grietzer; Gavin Leech; Nandi Schoots

Soft Contamination Means Benchmarks Test Shallow Generalization

Ari Spiesberger, Juan J. Vazquez, Nicky Pochinkov, Tomáš Gavenčiak, Peli Grietzer, Gavin Leech, Nandi Schoots

TL;DR

The paper investigates soft contamination of LLM training data by semantic duplicates and demonstrates that benchmark gains can arise from training data leakage rather than genuine out-of-distribution generalization. Using the open-source Olmo3 model, it shows widespread semantic duplication across major reasoning benchmarks and that finetuning on duplicates can boost benchmark scores, sometimes matching gains expected from true capability improvements. The findings suggest many recent reasoning-benchmark gains are confounded by data contamination, underscoring the need for careful benchmark auditing and ecologically valid evaluation of AI progress. The work also introduces ecologically realistic finetuning experiments to quantify practical contamination effects.

Abstract

If LLM training data is polluted with benchmark test data, then benchmark performance gives biased estimates of out-of-distribution (OOD) generalization. Typical decontamination filters use n-gram matching which fail to detect semantic duplicates: sentences with equivalent (or near-equivalent) content that are not close in string space. We study this soft contamination of training data by semantic duplicates. Among other experiments, we embed the Olmo3 training corpus and find that: 1) contamination remains widespread, e.g. we find semantic duplicates for 78% of CodeForces and exact duplicates for 50% of ZebraLogic problems; 2) including semantic duplicates of benchmark data in training does improve benchmark performance; and 3) when finetuning on duplicates of benchmark datapoints, performance also improves on truly-held-out datapoints from the same benchmark. We argue that recent benchmark gains are thus confounded: the prevalence of soft contamination means gains reflect both genuine capability improvements and the accumulation of test data and effective test data in growing training corpora.

Soft Contamination Means Benchmarks Test Shallow Generalization

TL;DR

Abstract

Paper Structure (35 sections, 10 figures, 9 tables)

This paper contains 35 sections, 10 figures, 9 tables.

Introduction
Related Work
Methodology
Benchmarks
Finding Semantic Duplicates in the Wild
Generating Synthetic Semantic Duplicates
Finetuning on Duplicates
Results
Exact duplicates in training corpora
Natural semantic duplicates in training corpora
Finetuning on Semantic Duplicates
Ecologically Valid Finetuning
Limitations and Future Work
Further Details on Methodology
Olmo3 Instruct Training Datasets
...and 20 more sections

Figures (10)

Figure 1: On the y-axis we plot the following statistic: for each ZebraLogic benchmark datapoint we check among the top 10 highest cosine similarity training datapoints if any of those samples is an exact duplicate, we then calculate the proportion of benchmark datapoints (of a given grid size) that have at least one exact duplicate. On the x-axis we plot puzzle grid size.
Figure 2: Relationship between cosine similarity level and semantic duplication. For each benchmark datapoint we sample 100 matches from the top 0.1% cosine similarity matches in the training data. On the x-axis we plot the cosine similarity. On the y-axis we plot the percentage of cosine similarity matches at this level that are true semantic duplicates. The opaque graph shows the confidence interval: this interval widens when there are fewer samples of a given cosine similarity level. In red we plot semantic duplicates inclusive of exact duplicates, and in blue exclusive.
Figure 3: Occurence by elo. On the y-axis we plot the following statistic: for each benchmark datapoint we check among the top 100 cosine similarity training datapoints if any of those samples is a semantic duplicate, we then calculate the proportion of all benchmark datapoints that have at least one semantic duplicate. We plot Elo scores on the x-axis.
Figure 4: Occurence by training dataset. On the y-axis: for each benchmark datapoint we check among the top 100 cosine similarity training datapoints if any of those samples is a semantic duplicate, we then calculate the proportion of all benchmark datapoints that have at least one semantic duplicate. On the x-axis we plot the different training datasets. The lines show the standard deviation.
Figure 5: Occurence by number of cosine similarity matches investigated. We take the number of semantic duplicates evaluated by being at top-n at each of our dataset comparisons.
...and 5 more figures

Soft Contamination Means Benchmarks Test Shallow Generalization

TL;DR

Abstract

Soft Contamination Means Benchmarks Test Shallow Generalization

Authors

TL;DR

Abstract

Table of Contents

Figures (10)