CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation
Jingqian Zhao, Bingbing Wang, Geng Tu, Yice Zhang, Qianlong Wang, Bin Liang, Jing Li, Ruifeng Xu
TL;DR
This paper tackles data contamination in LLM evaluation by proposing CoreEval, an automatic contamination-resilient framework that injects real-world knowledge from GDELT into existing datasets. It leverages entity extraction, knowledge retrieval, and knowledge recontextualization, followed by a data-reflection stage to ensure label-consistency and semantic coherence, thereby mitigating performance overestimation. The authors formalize a cognitive-learning-inspired workflow with three components—Real-World Knowledge Attainment, Knowledge Recontextualization, and Data Reflection—and validate the approach across five NLP tasks and multiple LLMs, reporting improved resistance to contamination via metrics $\delta_1$ and $\delta_2$. Extensive experiments demonstrate that CoreEval reduces contamination-driven gains, preserves dataset richness, and yields more stable evaluation even as contamination proportions and types vary, enabling fairer, more timely LLM benchmarking. The work contributes a general, low-human-effort method for contamination-resilient evaluation with practical implications for benchmarks, model development, and AI safety.
Abstract
Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose \textbf{CoreEval}, a \textbf{Co}ntamination-\textbf{re}silient \textbf{Eval}uation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant, up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism is employed to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.
