Table of Contents
Fetching ...

Benchmarks Are Not That Out of Distribution: Word Overlap Predicts Performance

Woojin Chung, Jeonghoon Kim

TL;DR

The paper investigates whether zero-shot benchmark performance in large language models primarily reflects word-level overlap between pre-training data and evaluation datasets. By defining tokenizer-agnostic metrics—word-level unigram cross-entropy $H(P_{eval},P_{pre})$ and word-frequency statistics—and conducting large-scale pre-training across multiple corpora, token counts, and model sizes, the authors reveal a robust inverse relationship between unigram cross-entropy and benchmark scores. They show that most benchmark performance correlates with distributional alignment rather than true out-of-distribution generalization, though some tasks (grammar, math, multilingual) deviate, indicating the limits of overlap-based explanations. The results imply that benchmark design and interpretation should account for word-frequency effects and distributional overlap, and they propose methods to identify higher-quality data subsets while cautioning against over-interpreting benchmark gains as genuine generalization. Overall, the work calls for rethinking what constitutes diagnostic benchmarks in the era of web-scale pre-training and highlights the importance of overlap diagnostics in data selection and evaluation.

Abstract

Understanding what constitutes high-quality pre-training data remains a central question in language model training. In this work, we investigate whether benchmark performance is primarily driven by the degree of statistical pattern overlap between pre-training corpora and evaluation datasets. We measure this overlap using word-level unigram cross-entropy and word frequency statistics, and perform controlled experiments across $10$ zero-shot benchmarks, $4$ pre-training datasets spanning $8.5\mathrm{B}$ to $60\mathrm{B}$ tokens, and model sizes ranging from $400\mathrm{M}$ to $3\mathrm{B}$ parameters. Our results demonstrate a robust inverse relationship between word-level unigram cross-entropy and benchmark performance, suggesting that widely used benchmarks are strongly influenced by word overlap between training and evaluation data. Thus, larger pre-training subsets with similar word-level unigram cross-entropy yield improved downstream results, indicating that word frequency statistics play an additional role in shaping benchmark scores. Taken together, these results suggest that many standard benchmarks are only weakly out-of-distribution relative to pre-training corpora, so that simple word-overlap statistics predict benchmark performance.

Benchmarks Are Not That Out of Distribution: Word Overlap Predicts Performance

TL;DR

The paper investigates whether zero-shot benchmark performance in large language models primarily reflects word-level overlap between pre-training data and evaluation datasets. By defining tokenizer-agnostic metrics—word-level unigram cross-entropy and word-frequency statistics—and conducting large-scale pre-training across multiple corpora, token counts, and model sizes, the authors reveal a robust inverse relationship between unigram cross-entropy and benchmark scores. They show that most benchmark performance correlates with distributional alignment rather than true out-of-distribution generalization, though some tasks (grammar, math, multilingual) deviate, indicating the limits of overlap-based explanations. The results imply that benchmark design and interpretation should account for word-frequency effects and distributional overlap, and they propose methods to identify higher-quality data subsets while cautioning against over-interpreting benchmark gains as genuine generalization. Overall, the work calls for rethinking what constitutes diagnostic benchmarks in the era of web-scale pre-training and highlights the importance of overlap diagnostics in data selection and evaluation.

Abstract

Understanding what constitutes high-quality pre-training data remains a central question in language model training. In this work, we investigate whether benchmark performance is primarily driven by the degree of statistical pattern overlap between pre-training corpora and evaluation datasets. We measure this overlap using word-level unigram cross-entropy and word frequency statistics, and perform controlled experiments across zero-shot benchmarks, pre-training datasets spanning to tokens, and model sizes ranging from to parameters. Our results demonstrate a robust inverse relationship between word-level unigram cross-entropy and benchmark performance, suggesting that widely used benchmarks are strongly influenced by word overlap between training and evaluation data. Thus, larger pre-training subsets with similar word-level unigram cross-entropy yield improved downstream results, indicating that word frequency statistics play an additional role in shaping benchmark scores. Taken together, these results suggest that many standard benchmarks are only weakly out-of-distribution relative to pre-training corpora, so that simple word-overlap statistics predict benchmark performance.
Paper Structure (32 sections, 22 equations, 3 figures, 15 tables)

This paper contains 32 sections, 22 equations, 3 figures, 15 tables.

Figures (3)

  • Figure 1: To evaluate how word-level overlap between the pre-training corpus and downstream benchmarks influences model performance, we first examine the impact of word frequency distribution similarity. This figure shows word-level cross-entropy and benchmark performance across eight downstream tasks for a $3.36\mathrm{B}$ model trained on a $60\mathrm{B}$ token subset. For each benchmark, pre-training datasets (C4, DCLM, and FineWeb-Edu) are compared in terms of unigram cross-entropy (top) and corresponding performance (bottom). Across all tasks, lower cross-entropy consistently corresponds to higher benchmark scores, producing a stable inverse relationship despite differences in absolute performance scales. The relative ordering of pre-training datasets is consistent across benchmarks, with FineWeb-Edu and DCLM typically achieving lower cross-entropy and higher performance than C4. (Full results are reported in Table \ref{['tab:3B']} in Appendix \ref{['apdx:word_overlap_correlation']})
  • Figure 2: We test whether the negative correlation between benchmark performance and unigram cross-entropy depends on specific subset pairings. This figure plots word-level cross-entropy against benchmark performance for ARC Easy and PIQA across multiple $8.5\mathrm{B}$ token subsets. Each point corresponds to a $400\mathrm{M}$ model trained on a fixed-scale subset of C4, DCLM, FineWeb-Edu and OpenWebText. Within each dataset, both cross-entropy and performance vary only slightly across subsets, while systematic differences across datasets remain, preserving the inverse relationship between cross-entropy and benchmark performance.
  • Figure 3: To test whether multilingual zero-shot performance is driven by the non-monolingual composition of pre-training data, we plot word-level cross-entropy against zero-shot performance on non-parallel multilingual PIQA (accuracy) and translated LAMBADA (perplexity) using a $3.36\mathrm{B}$ model trained on the $60\mathrm{B}$ token subset of DCLM. Across languages, word-level cross-entropy shows no clear or consistent correlation with zero-shot performance, indicating that multilingual generalization is not explained by word-level distributional overlap alone.