Table of Contents
Fetching ...

CAP: Data Contamination Detection via Consistency Amplification

Yi Zhao, Jing Li, Linyi Yang

TL;DR

A novel framework, Consistency Amplification-based Data Contamination Detection (CAP), which introduces the Performance Consistency Ratio (PCR) to measure dataset leakage by leveraging LM consistency, and is the first method to explicitly differentiate between fine-tuning and contamination.

Abstract

Large language models (LLMs) are widely used, but concerns about data contamination challenge the reliability of LLM evaluations. Existing contamination detection methods are often task-specific or require extra prerequisites, limiting practicality. We propose a novel framework, Consistency Amplification-based Data Contamination Detection (CAP), which introduces the Performance Consistency Ratio (PCR) to measure dataset leakage by leveraging LM consistency. To the best of our knowledge, this is the first method to explicitly differentiate between fine-tuning and contamination, which is crucial for detecting contamination in domain-specific models. Additionally, CAP is applicable to various benchmarks and works for both white-box and black-box models. We validate CAP's effectiveness through experiments on seven LLMs and four domain-specific benchmarks. Our findings also show that composite benchmarks from various dataset sources are particularly prone to unintentional contamination. Codes will be publicly available soon.

CAP: Data Contamination Detection via Consistency Amplification

TL;DR

A novel framework, Consistency Amplification-based Data Contamination Detection (CAP), which introduces the Performance Consistency Ratio (PCR) to measure dataset leakage by leveraging LM consistency, and is the first method to explicitly differentiate between fine-tuning and contamination.

Abstract

Large language models (LLMs) are widely used, but concerns about data contamination challenge the reliability of LLM evaluations. Existing contamination detection methods are often task-specific or require extra prerequisites, limiting practicality. We propose a novel framework, Consistency Amplification-based Data Contamination Detection (CAP), which introduces the Performance Consistency Ratio (PCR) to measure dataset leakage by leveraging LM consistency. To the best of our knowledge, this is the first method to explicitly differentiate between fine-tuning and contamination, which is crucial for detecting contamination in domain-specific models. Additionally, CAP is applicable to various benchmarks and works for both white-box and black-box models. We validate CAP's effectiveness through experiments on seven LLMs and four domain-specific benchmarks. Our findings also show that composite benchmarks from various dataset sources are particularly prone to unintentional contamination. Codes will be publicly available soon.

Paper Structure

This paper contains 36 sections, 9 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: (a) Heatmap of Performance Consistency Ratios: Change direction indicates scenario: Fine-Tuning or Data Contamination. (b) Method Comparison: CAP supports diverse tasks and is effective across models at varying transparency levels; it requires no prerequisites.
  • Figure 2: (a) LM Consistency: Different points in surface space may correspond to the same points in logical and factual space, or to nearby points in semantic space. (b) The CAP Framework: (i) The input stage consists of the training and test sets, covering benchmarks like multiple-choice questions (MCQ) and sequence generation tasks; (ii) Data modification is applied with a specific consistency criterion; (iii) Both the original and modified datasets are processed through LLMs, which can be black-box models; (iv) Finally, the Performance Consistency Ratio (PCR) is calculated and compared between the training and test sets to assess the presence of fine-tuning or contamination.
  • Figure 3: Consistency-based Data Modification: (1) For MCQ benchmarks like FinEval, we apply factual consistency by rearranging the corresponding content of options. (2) For Q&A tasks like AlphaFin and FinQA, we modify logically unrelated information (e.g., 'year') while keeping the reasoning process intact. (3) For summarization tasks like ECTSum, we modify the input text while maintaining semantic meaning (cosine similarity $> 0.979$).
  • Figure 4: Illustration of Four Benchmarks: (a) Baichuan and Disc-Fin may have been contaminated by the FinEval validation set; (b) and (c) FinMA-Full and FinMA-NLP were fine-tuned with the FinQA training set. AlphaFin Research, though newly proposed, has overlap with FinQA; (d) ECTSum is less likely to be used, with most values close to zero. However, it suggests LLaMa may have been contaminated by the training and test sets.
  • Figure 5: Samples of Factual-Consistency-Based and Semantic-Consistency-Based Data Modification
  • ...and 3 more figures