Table of Contents
Fetching ...

CLEAN-EVAL: Clean Evaluation on Contaminated Large Language Models

Wenhong Zhu, Hongkun Hao, Zhiwei He, Yunze Song, Yumeng Zhang, Hanxu Hu, Yiran Wei, Rui Wang, Hongyuan Lu

TL;DR

Data contamination can inflate LLM benchmark performance, obscuring true capabilities. Clean-Eval presents a three-stage pipeline (paraphrase, back-translation, semantic filtering) to generate semantically equivalent, surface-diverse benchmark variants and selects the best candidate using BLEURT. The method is validated across 20 datasets in both in-context learning and fine-tuning settings, showing calibration that aligns model performance with clean baselines and reduces contamination-driven inflation. This approach enhances reliability and transparency in evaluating rapidly evolving LLMs, with practical implications for benchmark design and model comparison.

Abstract

We are currently in an era of fierce competition among various large language models (LLMs) continuously pushing the boundaries of benchmark performance. However, genuinely assessing the capabilities of these LLMs has become a challenging and critical issue due to potential data contamination, and it wastes dozens of time and effort for researchers and engineers to download and try those contaminated models. To save our precious time, we propose a novel and useful method, Clean-Eval, which mitigates the issue of data contamination and evaluates the LLMs in a cleaner manner. Clean-Eval employs an LLM to paraphrase and back-translate the contaminated data into a candidate set, generating expressions with the same meaning but in different surface forms. A semantic detector is then used to filter the generated low-quality samples to narrow down this candidate set. The best candidate is finally selected from this set based on the BLEURT score. According to human assessment, this best candidate is semantically similar to the original contamination data but expressed differently. All candidates can form a new benchmark to evaluate the model. Our experiments illustrate that Clean-Eval substantially restores the actual evaluation results on contaminated LLMs under both few-shot learning and fine-tuning scenarios.

CLEAN-EVAL: Clean Evaluation on Contaminated Large Language Models

TL;DR

Data contamination can inflate LLM benchmark performance, obscuring true capabilities. Clean-Eval presents a three-stage pipeline (paraphrase, back-translation, semantic filtering) to generate semantically equivalent, surface-diverse benchmark variants and selects the best candidate using BLEURT. The method is validated across 20 datasets in both in-context learning and fine-tuning settings, showing calibration that aligns model performance with clean baselines and reduces contamination-driven inflation. This approach enhances reliability and transparency in evaluating rapidly evolving LLMs, with practical implications for benchmark design and model comparison.

Abstract

We are currently in an era of fierce competition among various large language models (LLMs) continuously pushing the boundaries of benchmark performance. However, genuinely assessing the capabilities of these LLMs has become a challenging and critical issue due to potential data contamination, and it wastes dozens of time and effort for researchers and engineers to download and try those contaminated models. To save our precious time, we propose a novel and useful method, Clean-Eval, which mitigates the issue of data contamination and evaluates the LLMs in a cleaner manner. Clean-Eval employs an LLM to paraphrase and back-translate the contaminated data into a candidate set, generating expressions with the same meaning but in different surface forms. A semantic detector is then used to filter the generated low-quality samples to narrow down this candidate set. The best candidate is finally selected from this set based on the BLEURT score. According to human assessment, this best candidate is semantically similar to the original contamination data but expressed differently. All candidates can form a new benchmark to evaluate the model. Our experiments illustrate that Clean-Eval substantially restores the actual evaluation results on contaminated LLMs under both few-shot learning and fine-tuning scenarios.
Paper Structure (50 sections, 6 figures, 9 tables)

This paper contains 50 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Data contamination happens when Benchmark A is included in the pretraining data, leading to inflated performance metrics like top leaderboard rankings. This can cause a clean model to lag behind the contaminated one. We aim to revise Benchmark A, preserving its meaning but changing its surface forms. This aims to re-evaluate the contaminated model and align its performance closer to that of a clean model.
  • Figure 2: An overview of our method. We first gather existing benchmarks for LLM assessment and then meticulously clean contamination in these benchmarks through LLM-powered paraphrase and multi-language back-translation, employing a semantic detector to filter and select optimal results based on BLEURT scores.
  • Figure 3: Evaluation setting of in-context learning. Each input comprises a demonstration and a tested sample. In the contamination setting, the demonstration matches the tested sample. In contrast, in the absence of contamination, the demonstration is drawn from another dataset split, maintaining distinction from the tested sample (e.g., sampled from the train split). In our Clean-Eval setup, the tested sample is a calibrated version of the demonstration, specifically designed to mitigate the effects of contamination.
  • Figure 4: Evaluation setting of fine-tuning. We fine-tuned two models using datasets labeled red and green. When evaluated on the red dataset, these two models are categorized as contaminated and uncontaminated. Testing a model's performance on the red dataset processed by Clean-Eval is attributed to the Clean-Eval setting.
  • Figure 5: A case study from SST2 dataset.
  • ...and 1 more figures