Table of Contents
Fetching ...

Mapping Overlaps in Benchmarks through Perplexity in the Wild

Siyang Wu, Honglin Bao, Sida Li, Ari Holtzman, James A. Evans

TL;DR

This work addresses the problem that large language model benchmarks exhibit substantial overlap and limited interpretability about the underlying abilities they measure. It introduces benchmark signatures, defined as subsets of salient tokens from in-the-wild corpora whose token-level perplexities across $m$ models strongly predict benchmark performance, thereby linking training exposure to observed competence. The authors implement a two-stage mining pipeline—token-level screening via Thrush and Pre-select correlations, followed by forward selection with AIC—to extract compact, predictive signatures from billions of tokens, and validate them on 32 models across 88 benchmarks. Results show that signature overlaps provide far stronger discrimination between benchmarks than semantic similarity or performance correlations, while remaining robust to model-family and question-format biases; cross-functional overlaps reveal interconnected capacity across logic, math, language, instruction, and world modeling, with coding appearing more isolated. Overall, the signature framework offers mechanistic insights into LLM capabilities and benchmark validity and suggests new directions for benchmark design and analysis, including potential extensions to layer-wise or algebraic analyses of benchmark structures.

Abstract

We develop signatures of capacity familiarity to characterize large language model (LLM) benchmarks and their meaningful overlaps. Benchmark signatures probe the capacity required for benchmark performance. We formally define them as a set of salient tokens drawn from in-the-wild, naturally authored corpora, where LLM token perplexity, reflecting more or less pre-training exposure, becomes highly predictive of LLM benchmark performance. Through a large-scale meta-evaluation, we extract benchmark signatures via stepwise forward selection with linear regressions across 32 LLMs and 88 benchmarks spanning diverse knowledge, coding, logic, instruction following, math, language, reasoning, and world modeling. Our analysis situates signatures in relation to both the semantic similarity of benchmark questions and the correlation of model performance. While performance overlaps are universally high and semantic overlaps remain confined to a narrow mid-range, benchmark signatures prove highly informative in capturing variation, overlap, and divergence. We observe overlap in knowledge and reasoning subtasks, whereas multilingual and cultural benchmarks exhibit less similarity, even compared to cross-task overlap. Notably, performance-level results are strongly influenced by benchmark-orthogonal factors such as question format, highlighting limitations in LLM generalization, the conflation of performance with ability, and issues inherent in current mainstream benchmark agreement studies. Benchmark signatures, however, remain robust to such effects. Ultimately, we identify cross-functional overlaps across logic, math, language, instruction following, and world modeling, with coding emerging as the least overlapping domain. Together, these findings provide mechanistic insights into benchmark validity and LLM sensitivities, and sketch the underlying landscape of interconnected LLM capabilities.

Mapping Overlaps in Benchmarks through Perplexity in the Wild

TL;DR

This work addresses the problem that large language model benchmarks exhibit substantial overlap and limited interpretability about the underlying abilities they measure. It introduces benchmark signatures, defined as subsets of salient tokens from in-the-wild corpora whose token-level perplexities across models strongly predict benchmark performance, thereby linking training exposure to observed competence. The authors implement a two-stage mining pipeline—token-level screening via Thrush and Pre-select correlations, followed by forward selection with AIC—to extract compact, predictive signatures from billions of tokens, and validate them on 32 models across 88 benchmarks. Results show that signature overlaps provide far stronger discrimination between benchmarks than semantic similarity or performance correlations, while remaining robust to model-family and question-format biases; cross-functional overlaps reveal interconnected capacity across logic, math, language, instruction, and world modeling, with coding appearing more isolated. Overall, the signature framework offers mechanistic insights into LLM capabilities and benchmark validity and suggests new directions for benchmark design and analysis, including potential extensions to layer-wise or algebraic analyses of benchmark structures.

Abstract

We develop signatures of capacity familiarity to characterize large language model (LLM) benchmarks and their meaningful overlaps. Benchmark signatures probe the capacity required for benchmark performance. We formally define them as a set of salient tokens drawn from in-the-wild, naturally authored corpora, where LLM token perplexity, reflecting more or less pre-training exposure, becomes highly predictive of LLM benchmark performance. Through a large-scale meta-evaluation, we extract benchmark signatures via stepwise forward selection with linear regressions across 32 LLMs and 88 benchmarks spanning diverse knowledge, coding, logic, instruction following, math, language, reasoning, and world modeling. Our analysis situates signatures in relation to both the semantic similarity of benchmark questions and the correlation of model performance. While performance overlaps are universally high and semantic overlaps remain confined to a narrow mid-range, benchmark signatures prove highly informative in capturing variation, overlap, and divergence. We observe overlap in knowledge and reasoning subtasks, whereas multilingual and cultural benchmarks exhibit less similarity, even compared to cross-task overlap. Notably, performance-level results are strongly influenced by benchmark-orthogonal factors such as question format, highlighting limitations in LLM generalization, the conflation of performance with ability, and issues inherent in current mainstream benchmark agreement studies. Benchmark signatures, however, remain robust to such effects. Ultimately, we identify cross-functional overlaps across logic, math, language, instruction following, and world modeling, with coding emerging as the least overlapping domain. Together, these findings provide mechanistic insights into benchmark validity and LLM sensitivities, and sketch the underlying landscape of interconnected LLM capabilities.

Paper Structure

This paper contains 32 sections, 6 equations, 5 figures, 4 tables, 4 algorithms.

Figures (5)

  • Figure 1: Left: Signature-based correlations across benchmark functions. The Spearman correlation is on average 0.285 for benchmarks within the same design function and 0.087 for cross-function overlaps. Right: Performance-level correlations grouped by benchmark families (MMLU vs. BBH) and question formats (Multi-Choice Questions vs. True-or-False). Mainstream performance-based benchmark agreements are biased towards these benchmark-orthogonal factors (red areas) rather than actual design functions. $\rho_s$ represents the similarity range in the right panel.
  • Figure 2: Overview of the rationale of how in-the-wild corpora implicitly encode the benchmark signature, knowledge exposure, as well as benchmark performance.
  • Figure 3: Distribution of Thrush correlations in pre-selection phases; red vertical lines mark the 1st and 99th percentiles, highlighting that few features are highly correlated with performance.
  • Figure 4: Three levels of benchmark relation analysis. The signature-level analysis demonstrates substantially stronger discriminative ability compared to both semantic- and performance-level analyses. All heatmaps are presented using a consistent color range from -1 to 1, and panels b and c share the same row and column indices articulated in panel a.
  • Figure 5: Biases (within/between families; same/diff. formats) are well addressed by the signature.

Theorems & Definitions (2)

  • Definition 3.1: Thrush Correlation thrush2024improving
  • Definition 3.2: Pre-select Correlation shum2025predictive