Pitfalls of Evaluating Language Models with Open Benchmarks

Md. Najib Hasan; Md Mahadi Hassan Sibat; Mohammad Fakhruddin Babar; Souvika Sarkar; Monowar Hasan; Santu Karmaker

Pitfalls of Evaluating Language Models with Open Benchmarks

Md. Najib Hasan, Md Mahadi Hassan Sibat, Mohammad Fakhruddin Babar, Souvika Sarkar, Monowar Hasan, Santu Karmaker

TL;DR

Open, holistic benchmarks enable transparent LM evaluation but introduce data leakage risks that can be exploited to inflate leaderboard rankings. The authors construct cheating models across BART, T5, GPT-2, and mid-sized LLMs, showing memorization-driven gains on public test-sets under 1/n and n/1 setups, with poor generalization to unseen data. They explore paraphrase-based safeguards, finding partial mitigation: semantic reformulations reduce surface-level leakage but become ineffective once the paraphrase strategy is known. The work argues for dynamic, private, or hybrid evaluation frameworks to preserve leaderboard integrity and ensure that benchmark progress reflects real generalization capabilities rather than memorization.

Abstract

Open Large Language Model (LLM) benchmarks, such as HELM and BIG-Bench, provide standardized and transparent evaluation protocols that support comparative analysis, reproducibility, and systematic progress tracking in Language Model (LM) research. Yet, this openness also creates substantial risks of data leakage during LM testing--deliberate or inadvertent, thereby undermining the fairness and reliability of leaderboard rankings and leaving them vulnerable to manipulation by unscrupulous actors. We illustrate the severity of this issue by intentionally constructing cheating models: smaller variants of BART, T5, and GPT-2, fine-tuned directly on publicly available test-sets. As expected, these models excel on the target benchmarks but fail terribly to generalize to comparable unseen testing sets. We then examine task specific simple paraphrase-based safeguarding strategies to mitigate the impact of data leakage and evaluate their effectiveness and limitations. Our findings underscore three key points: (i) high leaderboard performance on limited open, static benchmarks may not reflect real-world utility; (ii) private or dynamically generated benchmarks should complement open benchmarks to maintain evaluation integrity; and (iii) a reexamination of current benchmarking practices is essential for reliable and trustworthy LM assessment.

Pitfalls of Evaluating Language Models with Open Benchmarks

TL;DR

Abstract

Pitfalls of Evaluating Language Models with Open Benchmarks

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)