Table of Contents
Fetching ...

The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination

Yifan Sun, Han Wang, Dongbai Li, Gang Wang, Huan Zhang

TL;DR

The paper tackles Benchmark Data Contamination by introducing a rigorous, question-level evaluation framework built around two metrics—fidelity and contamination resistance—to assess mitigation strategies that update existing benchmarks. It designs a controlled pipeline, validates uncontaminated LLM-benchmark pairs, and subjects 20 mitigation strategies to two contamination scenarios across 10 LLMs and 5 benchmarks, deriving evaluation vectors to quantify how updates preserve original semantics and resist memorization. The key finding is that no current strategy consistently outperforms the vanilla approach across all benchmarks, and there is a clear trade-off: minor updates tend to maintain fidelity but offer limited resistance, while more substantial or semantic-altering updates raise resistance at the cost of fidelity. The work provides a practical, replicable framework and code to enable robust evaluation of BDC mitigation methods, emphasizing the need for novel strategies that achieve high fidelity and resistance simultaneously and informing practitioners about the limitations of existing approaches.

Abstract

Benchmark Data Contamination (BDC)-the inclusion of benchmark testing samples in the training set-has raised increasing concerns in Large Language Model (LLM) evaluation, leading to falsely inflated performance estimates and undermining evaluation reliability. To address this, researchers have proposed various mitigation strategies to update existing benchmarks, including modifying original questions or generating new ones based on them. However, a rigorous examination of the effectiveness of these mitigation strategies remains lacking. In this paper, we design a systematic and controlled pipeline along with two novel metrics-fidelity and contamination resistance-to provide a fine-grained and comprehensive assessment of existing BDC mitigation strategies. Previous assessment methods, such as accuracy drop and accuracy matching, focus solely on aggregate accuracy, often leading to incomplete or misleading conclusions. Our metrics address this limitation by emphasizing question-level evaluation result matching. Extensive experiments with 10 LLMs, 5 benchmarks, 20 BDC mitigation strategies, and 2 contamination scenarios reveal that no existing strategy significantly improves resistance over the vanilla case (i.e., no benchmark update) across all benchmarks, and none effectively balances fidelity and contamination resistance. These findings underscore the urgent need for designing more effective BDC mitigation strategies. Our code repository is available at https://github.com/ASTRAL-Group/BDC_mitigation_assessment.

The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination

TL;DR

The paper tackles Benchmark Data Contamination by introducing a rigorous, question-level evaluation framework built around two metrics—fidelity and contamination resistance—to assess mitigation strategies that update existing benchmarks. It designs a controlled pipeline, validates uncontaminated LLM-benchmark pairs, and subjects 20 mitigation strategies to two contamination scenarios across 10 LLMs and 5 benchmarks, deriving evaluation vectors to quantify how updates preserve original semantics and resist memorization. The key finding is that no current strategy consistently outperforms the vanilla approach across all benchmarks, and there is a clear trade-off: minor updates tend to maintain fidelity but offer limited resistance, while more substantial or semantic-altering updates raise resistance at the cost of fidelity. The work provides a practical, replicable framework and code to enable robust evaluation of BDC mitigation methods, emphasizing the need for novel strategies that achieve high fidelity and resistance simultaneously and informing practitioners about the limitations of existing approaches.

Abstract

Benchmark Data Contamination (BDC)-the inclusion of benchmark testing samples in the training set-has raised increasing concerns in Large Language Model (LLM) evaluation, leading to falsely inflated performance estimates and undermining evaluation reliability. To address this, researchers have proposed various mitigation strategies to update existing benchmarks, including modifying original questions or generating new ones based on them. However, a rigorous examination of the effectiveness of these mitigation strategies remains lacking. In this paper, we design a systematic and controlled pipeline along with two novel metrics-fidelity and contamination resistance-to provide a fine-grained and comprehensive assessment of existing BDC mitigation strategies. Previous assessment methods, such as accuracy drop and accuracy matching, focus solely on aggregate accuracy, often leading to incomplete or misleading conclusions. Our metrics address this limitation by emphasizing question-level evaluation result matching. Extensive experiments with 10 LLMs, 5 benchmarks, 20 BDC mitigation strategies, and 2 contamination scenarios reveal that no existing strategy significantly improves resistance over the vanilla case (i.e., no benchmark update) across all benchmarks, and none effectively balances fidelity and contamination resistance. These findings underscore the urgent need for designing more effective BDC mitigation strategies. Our code repository is available at https://github.com/ASTRAL-Group/BDC_mitigation_assessment.

Paper Structure

This paper contains 29 sections, 6 equations, 4 figures, 15 tables.

Figures (4)

  • Figure 1: Illustration of BDC mitigation strategies. BDC mitigation strategies, such as synonym replacement and analysis extension ying2024automating, update benchmark questions to reduce the risk of direct memorization.
  • Figure 2: The limitations of existing approaches for assessing BDC mitigation strategies: (a) Accuracy drop measures the performance decline between contaminated accuracy and mitigated accuracy, but does not account for the clean accuracy, making it unclear how much drop indicates effective mitigation. (b) Accuracy matching requires that the mitigated accuracy restores clean accuracy. However, as shown in the example, even when the accuracies match, the question-level evaluation results differ significantly (e.g., correctly answering the 1st and 2nd questions versus the 4th and 5th). This discrepancy suggests that the updated benchmark may evaluate different aspects of model capacity compared to the original benchmark. As a result, the mitigation strategy may fail to preserve the original benchmark’s evaluation objective and could be ineffective.
  • Figure 3: Overview of our pipeline for assessing BDC mitigation strategies: (1) We select an LLM-benchmark pair and ensure it passes three BDC detection methods to confirm it is uncontaminated, a crucial step for reliable "clean" evaluation results (§\ref{['sec:pipeline_filtering']}). (2) Each mitigation strategy is applied separately to the original benchmark to produce an updated benchmark; 20 strategies are examined in total (§\ref{['sec:pipeline_mitigation']}). (3) The uncontaminated LLM is fine-tuned on the original benchmark dataset. Two contamination recipes (mild and intensive) are tested to ensure robust conclusions and three validation checks are performed to confirm the effectiveness of the contamination process (§\ref{['sec:pipeline_contamination']}). (4) Evaluation vectors are computed for: (a) uncontaminated LLM with the original benchmark, (b) uncontaminated LLM with the updated benchmark, and (c) contaminated LLM with the updated benchmark (§\ref{['sec:pipeline_obtain']}). (5) Fidelity and resistance are derived based on the degree of matching between these evaluation vectors (§\ref{['sec:method']}). An effective mitigation strategy should achieve high scores in both metrics.
  • Figure 4: Fidelity-resistance scores across different BDC mitigation strategies under (a) mild and (b) intensive contamination. Single strategies are shown in blue, combined strategies in yellow, and the vanilla case in red. An ideal strategy should lie in the upper-right, but no existing approach achieves this balance. For visual clarity, a few strategies that overlap closely with others are omitted.