Benchmarks Are Not That Out of Distribution: Word Overlap Predicts Performance
Woojin Chung, Jeonghoon Kim
TL;DR
The paper investigates whether zero-shot benchmark performance in large language models primarily reflects word-level overlap between pre-training data and evaluation datasets. By defining tokenizer-agnostic metrics—word-level unigram cross-entropy $H(P_{eval},P_{pre})$ and word-frequency statistics—and conducting large-scale pre-training across multiple corpora, token counts, and model sizes, the authors reveal a robust inverse relationship between unigram cross-entropy and benchmark scores. They show that most benchmark performance correlates with distributional alignment rather than true out-of-distribution generalization, though some tasks (grammar, math, multilingual) deviate, indicating the limits of overlap-based explanations. The results imply that benchmark design and interpretation should account for word-frequency effects and distributional overlap, and they propose methods to identify higher-quality data subsets while cautioning against over-interpreting benchmark gains as genuine generalization. Overall, the work calls for rethinking what constitutes diagnostic benchmarks in the era of web-scale pre-training and highlights the importance of overlap diagnostics in data selection and evaluation.
Abstract
Understanding what constitutes high-quality pre-training data remains a central question in language model training. In this work, we investigate whether benchmark performance is primarily driven by the degree of statistical pattern overlap between pre-training corpora and evaluation datasets. We measure this overlap using word-level unigram cross-entropy and word frequency statistics, and perform controlled experiments across $10$ zero-shot benchmarks, $4$ pre-training datasets spanning $8.5\mathrm{B}$ to $60\mathrm{B}$ tokens, and model sizes ranging from $400\mathrm{M}$ to $3\mathrm{B}$ parameters. Our results demonstrate a robust inverse relationship between word-level unigram cross-entropy and benchmark performance, suggesting that widely used benchmarks are strongly influenced by word overlap between training and evaluation data. Thus, larger pre-training subsets with similar word-level unigram cross-entropy yield improved downstream results, indicating that word frequency statistics play an additional role in shaping benchmark scores. Taken together, these results suggest that many standard benchmarks are only weakly out-of-distribution relative to pre-training corpora, so that simple word-overlap statistics predict benchmark performance.
