Benchmark Data Contamination of Large Language Models: A Survey
Cheng Xu, Shuhao Guan, Derek Greene, M-Tahar Kechadi
TL;DR
Benchmark Data Contamination (BDC) undermines the reliability of LLM evaluations by leaking benchmark information into training data. The paper provides a comprehensive survey that splits detection methods into matching-based and comparison-based techniques, and mitigation strategies into data curation, data refactoring, and benchmark-free evaluation, highlighting tools such as TS-Guessing, EvoEval, TreeEval, and FreeEval. It discusses practical challenges like data access, computational cost, and semantic contamination, and suggests future directions including dynamic benchmarks, human-in-the-loop evaluation, and content-tagging for benchmarks. The work argues for a multi-faceted evaluation framework to preserve trustworthy assessments as LLMs and their training data continue to grow in scale and complexity.
Abstract
The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini has transformed the field of natural language processing. However, it has also resulted in a significant issue known as Benchmark Data Contamination (BDC). This occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during the evaluation phase of the process. This paper reviews the complex challenge of BDC in LLM evaluation and explores alternative assessment methods to mitigate the risks associated with traditional benchmarks. The paper also examines challenges and future directions in mitigating BDC risks, highlighting the complexity of the issue and the need for innovative solutions to ensure the reliability of LLM evaluation in real-world applications.
