Table of Contents
Fetching ...

LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction

Yucheng Li, Frank Guerin, Chenghua Lin

TL;DR

LatestEval tackles data contamination in language model evaluation by constructing a dynamic, time-sensitive reading comprehension benchmark from the most recent texts (arXiv, BBC, GitHub). It employs a three-stage pipeline to collect passages, extract key information as answers, and generate questions while removing explicit answers to promote reasoning. The approach is evaluated through a contamination-focused perplexity test, performance comparisons using an LLM-as-a-judge, and human evaluation of faithfulness, answerability, and copyability, demonstrating reduced memorisation and robust differentiation among models. The work provides a practical, publicly available framework and data to enable more reliable, up-to-date benchmarking of large language models.

Abstract

Data contamination in evaluation is getting increasingly prevalent with the emergence of language models pre-trained on super large, automatically crawled corpora. This problem leads to significant challenges in the accurate assessment of model capabilities and generalisations. In this paper, we propose LatestEval, an automatic method that leverages the most recent texts to create uncontaminated reading comprehension evaluations. LatestEval avoids data contamination by only using texts published within a recent time window, ensuring no overlap with the training corpora of pre-trained language models. We develop the LatestEval automated pipeline to 1) gather the latest texts; 2) identify key information, and 3) construct questions targeting the information while removing the existing answers from the context. This encourages models to infer the answers themselves based on the remaining context, rather than just copy-paste. Our experiments demonstrate that language models exhibit negligible memorisation behaviours on LatestEval as opposed to previous benchmarks, suggesting a significantly reduced risk of data contamination and leading to a more robust evaluation. Data and code are publicly available at: https://github.com/liyucheng09/LatestEval.

LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction

TL;DR

LatestEval tackles data contamination in language model evaluation by constructing a dynamic, time-sensitive reading comprehension benchmark from the most recent texts (arXiv, BBC, GitHub). It employs a three-stage pipeline to collect passages, extract key information as answers, and generate questions while removing explicit answers to promote reasoning. The approach is evaluated through a contamination-focused perplexity test, performance comparisons using an LLM-as-a-judge, and human evaluation of faithfulness, answerability, and copyability, demonstrating reduced memorisation and robust differentiation among models. The work provides a practical, publicly available framework and data to enable more reliable, up-to-date benchmarking of large language models.

Abstract

Data contamination in evaluation is getting increasingly prevalent with the emergence of language models pre-trained on super large, automatically crawled corpora. This problem leads to significant challenges in the accurate assessment of model capabilities and generalisations. In this paper, we propose LatestEval, an automatic method that leverages the most recent texts to create uncontaminated reading comprehension evaluations. LatestEval avoids data contamination by only using texts published within a recent time window, ensuring no overlap with the training corpora of pre-trained language models. We develop the LatestEval automated pipeline to 1) gather the latest texts; 2) identify key information, and 3) construct questions targeting the information while removing the existing answers from the context. This encourages models to infer the answers themselves based on the remaining context, rather than just copy-paste. Our experiments demonstrate that language models exhibit negligible memorisation behaviours on LatestEval as opposed to previous benchmarks, suggesting a significantly reduced risk of data contamination and leading to a more robust evaluation. Data and code are publicly available at: https://github.com/liyucheng09/LatestEval.
Paper Structure (14 sections, 4 figures, 3 tables)

This paper contains 14 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The overall pipeline of LatestEval. Step 1 is for collecting the latest texts; 2,3 are to construct the answers; 4 is to construct corresponding queries; and 5 is to prepare the passages.
  • Figure 2: The comparison of datasets' perplexities indicates the contamination extent on various language models.
  • Figure 3: Memorisation test of GPT-4 model on four benchmarks. Coloured text refers to the text generated by GPT-4 that matches the original test text. The four examples shown are just the first instance of each benchmark, so no cherry picking.
  • Figure 4: (a): single answer scores across five types of queries; (b): pair-wise win rate, y-axis indicates the winner.