Table of Contents
Fetching ...

Pitfalls of Evaluating Language Models with Open Benchmarks

Md. Najib Hasan, Md Mahadi Hassan Sibat, Mohammad Fakhruddin Babar, Souvika Sarkar, Monowar Hasan, Santu Karmaker

TL;DR

Open, holistic benchmarks enable transparent LM evaluation but introduce data leakage risks that can be exploited to inflate leaderboard rankings. The authors construct cheating models across BART, T5, GPT-2, and mid-sized LLMs, showing memorization-driven gains on public test-sets under 1/n and n/1 setups, with poor generalization to unseen data. They explore paraphrase-based safeguards, finding partial mitigation: semantic reformulations reduce surface-level leakage but become ineffective once the paraphrase strategy is known. The work argues for dynamic, private, or hybrid evaluation frameworks to preserve leaderboard integrity and ensure that benchmark progress reflects real generalization capabilities rather than memorization.

Abstract

Open Large Language Model (LLM) benchmarks, such as HELM and BIG-Bench, provide standardized and transparent evaluation protocols that support comparative analysis, reproducibility, and systematic progress tracking in Language Model (LM) research. Yet, this openness also creates substantial risks of data leakage during LM testing--deliberate or inadvertent, thereby undermining the fairness and reliability of leaderboard rankings and leaving them vulnerable to manipulation by unscrupulous actors. We illustrate the severity of this issue by intentionally constructing cheating models: smaller variants of BART, T5, and GPT-2, fine-tuned directly on publicly available test-sets. As expected, these models excel on the target benchmarks but fail terribly to generalize to comparable unseen testing sets. We then examine task specific simple paraphrase-based safeguarding strategies to mitigate the impact of data leakage and evaluate their effectiveness and limitations. Our findings underscore three key points: (i) high leaderboard performance on limited open, static benchmarks may not reflect real-world utility; (ii) private or dynamically generated benchmarks should complement open benchmarks to maintain evaluation integrity; and (iii) a reexamination of current benchmarking practices is essential for reliable and trustworthy LM assessment.

Pitfalls of Evaluating Language Models with Open Benchmarks

TL;DR

Open, holistic benchmarks enable transparent LM evaluation but introduce data leakage risks that can be exploited to inflate leaderboard rankings. The authors construct cheating models across BART, T5, GPT-2, and mid-sized LLMs, showing memorization-driven gains on public test-sets under 1/n and n/1 setups, with poor generalization to unseen data. They explore paraphrase-based safeguards, finding partial mitigation: semantic reformulations reduce surface-level leakage but become ineffective once the paraphrase strategy is known. The work argues for dynamic, private, or hybrid evaluation frameworks to preserve leaderboard integrity and ensure that benchmark progress reflects real generalization capabilities rather than memorization.

Abstract

Open Large Language Model (LLM) benchmarks, such as HELM and BIG-Bench, provide standardized and transparent evaluation protocols that support comparative analysis, reproducibility, and systematic progress tracking in Language Model (LM) research. Yet, this openness also creates substantial risks of data leakage during LM testing--deliberate or inadvertent, thereby undermining the fairness and reliability of leaderboard rankings and leaving them vulnerable to manipulation by unscrupulous actors. We illustrate the severity of this issue by intentionally constructing cheating models: smaller variants of BART, T5, and GPT-2, fine-tuned directly on publicly available test-sets. As expected, these models excel on the target benchmarks but fail terribly to generalize to comparable unseen testing sets. We then examine task specific simple paraphrase-based safeguarding strategies to mitigate the impact of data leakage and evaluate their effectiveness and limitations. Our findings underscore three key points: (i) high leaderboard performance on limited open, static benchmarks may not reflect real-world utility; (ii) private or dynamically generated benchmarks should complement open benchmarks to maintain evaluation integrity; and (iii) a reexamination of current benchmarking practices is essential for reliable and trustworthy LM assessment.

Paper Structure

This paper contains 27 sections, 14 figures, 34 tables.

Figures (14)

  • Figure 1: Public Leaderboard Gaming in open-evaluation benchmarks and effects of safeguard strategy.
  • Figure 2: Performance heatmaps for SLMs, BART-base (3/3), T5-small (6/6) and LLMs Qwen-3 (0.6B) and LLaMA 3.2 (1B) under two evaluation strategies. Sub figure (a), (c), (e) and (g) shows the 1/n setup, where the model is trained on one HELM-lite scenario (Y-axis) and tested on all 10 scenarios (X-axis). Sub figure (b), (d), (f) and (h) shows the n/1 setup, where the model is trained on nine scenarios (Y-axis) and tested on one held-out scenario (X-axis), for example: n-NQ_Openbook means the model is trained on nine scenarios except NQ_Openbook and tested on each of the ten scenarios separately.
  • Figure 3: Controlled paraphrasing strategies for both open-ended and multiple-choice tasks of HELM-lite.
  • Figure 4: Performance evaluation of scratch-trained SLMs (BART-base) and fine-tuned LLMs (LLaMA 3.2 (1B)) using fuzzy-matching accuracy across two scenario types. Sub figure (a) compares the fuzzy-matching accuracy (%) across seven open-ended scenarios under two evaluation setups, (1) training and testing on the original test-set (orig→orig), and (2) training on the original test-set and evaluating on paraphrased questions (Orig→PQ). Sub figure (b) compares fuzzy-matching accuracy (%) for multiple-choice scenarios (MMLU and MedQA) on 4 different setups, (1) Evaluate on paraphrased questions (Para'd Q). (2) Evaluate on paraphrased options (Para'd Optn). (3) Evaluate on paraphrased questions + options (Para'd Q + Optn). (4) Evaluate on paraphrased questions + shuffled options (Para'd Q + Optn. shuffled).
  • Figure 5: Performance evaluation of scratch trained SLMs (BART-base) using fuzzy-matching accuracy across two different scenario types after training on ten paraphrased (PQx10) versions (When cheating models gain knowledge about paraphrasing). Sub figure (a) shows the fuzzy-matching accuracy (%) of BART-base for seven open-ended scenarios. Sub figure (b) shows four MCQ setups which are (1) MCQ setup-1: train on 10 paraphrased question versions, test on unseen version. (2) MCQ setup-2: train on 10 paraphrased option versions, test on unseen version. (3) MCQ setup-3: train on 10 paraphrased questions + options, test on unseen version. (4) MCQ setup-4: same as setup-3 but with option position shuffled.
  • ...and 9 more figures