Table of Contents
Fetching ...

CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation

Jingqian Zhao, Bingbing Wang, Geng Tu, Yice Zhang, Qianlong Wang, Bin Liang, Jing Li, Ruifeng Xu

TL;DR

This paper tackles data contamination in LLM evaluation by proposing CoreEval, an automatic contamination-resilient framework that injects real-world knowledge from GDELT into existing datasets. It leverages entity extraction, knowledge retrieval, and knowledge recontextualization, followed by a data-reflection stage to ensure label-consistency and semantic coherence, thereby mitigating performance overestimation. The authors formalize a cognitive-learning-inspired workflow with three components—Real-World Knowledge Attainment, Knowledge Recontextualization, and Data Reflection—and validate the approach across five NLP tasks and multiple LLMs, reporting improved resistance to contamination via metrics $\delta_1$ and $\delta_2$. Extensive experiments demonstrate that CoreEval reduces contamination-driven gains, preserves dataset richness, and yields more stable evaluation even as contamination proportions and types vary, enabling fairer, more timely LLM benchmarking. The work contributes a general, low-human-effort method for contamination-resilient evaluation with practical implications for benchmarks, model development, and AI safety.

Abstract

Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose \textbf{CoreEval}, a \textbf{Co}ntamination-\textbf{re}silient \textbf{Eval}uation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant, up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism is employed to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.

CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation

TL;DR

This paper tackles data contamination in LLM evaluation by proposing CoreEval, an automatic contamination-resilient framework that injects real-world knowledge from GDELT into existing datasets. It leverages entity extraction, knowledge retrieval, and knowledge recontextualization, followed by a data-reflection stage to ensure label-consistency and semantic coherence, thereby mitigating performance overestimation. The authors formalize a cognitive-learning-inspired workflow with three components—Real-World Knowledge Attainment, Knowledge Recontextualization, and Data Reflection—and validate the approach across five NLP tasks and multiple LLMs, reporting improved resistance to contamination via metrics and . Extensive experiments demonstrate that CoreEval reduces contamination-driven gains, preserves dataset richness, and yields more stable evaluation even as contamination proportions and types vary, enabling fairer, more timely LLM benchmarking. The work contributes a general, low-human-effort method for contamination-resilient evaluation with practical implications for benchmarks, model development, and AI safety.

Abstract

Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose \textbf{CoreEval}, a \textbf{Co}ntamination-\textbf{re}silient \textbf{Eval}uation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant, up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism is employed to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.

Paper Structure

This paper contains 28 sections, 5 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Different workflows for mitigating data contamination: (a) Data Rewriting, where LLMs modify existing data, potentially altering original labels; (b) Data Generation, where LLMs create new data from original data and task instructions, risking loss of semantic complexity; and (c) Our CoreEval Framework, where LLMs integrate external knowledge with original data for robust, semantically coherent, and label-consistent updates.
  • Figure 2: Overall flow of our CoreEval framework.
  • Figure 3: Performance (%) of the eleven involved LLMs (zero-shot) on the original and our updated datasets. We employ various prompt templates and use their average as the final result. Refer to Appendix \ref{['appendix-performance']} for further details.
  • Figure 4: Data contamination resistance (%) of eight open-source models under different data proportions (20%, 40%, 60%, 80%, 100%). The first row shows $\delta_1$ values for the test set-only scenario across the original dataset, semantic dataset, and our updated dataset. The second row presents $\delta_2$ values for the train and test set scenario. The results are the mean values calculated across all eight open-source models.
  • Figure 5: Prompt in Real-World Knowledge Attainment
  • ...and 8 more figures