Table of Contents
Fetching ...

Investigating Data Contamination for Pre-training Language Models

Minhao Jiang, Ken Ziyu Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, Jiawei Han, Sanmi Koyejo

TL;DR

This work investigates data contamination at the pre-training stage by injecting evaluation data into pre-training corpora in two forms: text contamination (inputs) and ground-truth contamination (prompts and answers). By pre-training GPT-2 models from scratch and evaluating on SST-2, MMLU, CNN/DM, and SQuAD, it finds ground-truth leakage often yields larger performance gains than text leakage, with task-specific sensitivity. It further reveals a non-monotonic, U-shaped relationship when contamination is repeated and shows that common $n$-gram based contamination definitions can misrepresent true leakage, even at scale with GPT-2-large. The study argues for independent, rigorous contamination assessments in LLM research and highlights the limitations of evaluation-level analyses in proving robustness to contamination.

Abstract

Language models pre-trained on web-scale corpora demonstrate impressive capabilities on diverse downstream tasks. However, there is increasing concern whether such capabilities might arise from evaluation datasets being included in the pre-training corpus -- a phenomenon known as \textit{data contamination} -- in a manner that artificially increases performance. There has been little understanding of how this potential contamination might influence LMs' performance on downstream tasks. In this paper, we explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models \textit{from scratch}. We highlight the effect of both text contamination (\textit{i.e.}\ input text of the evaluation samples) and ground-truth contamination (\textit{i.e.}\ the prompts asked on the input and the desired outputs) from evaluation data. We also investigate the effects of repeating contamination for various downstream tasks. Additionally, we examine the prevailing n-gram-based definitions of contamination within current LLM reports, pinpointing their limitations and inadequacy. Our findings offer new insights into data contamination's effects on language model capabilities and underscore the need for independent, comprehensive contamination assessments in LLM studies.

Investigating Data Contamination for Pre-training Language Models

TL;DR

This work investigates data contamination at the pre-training stage by injecting evaluation data into pre-training corpora in two forms: text contamination (inputs) and ground-truth contamination (prompts and answers). By pre-training GPT-2 models from scratch and evaluating on SST-2, MMLU, CNN/DM, and SQuAD, it finds ground-truth leakage often yields larger performance gains than text leakage, with task-specific sensitivity. It further reveals a non-monotonic, U-shaped relationship when contamination is repeated and shows that common -gram based contamination definitions can misrepresent true leakage, even at scale with GPT-2-large. The study argues for independent, rigorous contamination assessments in LLM research and highlights the limitations of evaluation-level analyses in proving robustness to contamination.

Abstract

Language models pre-trained on web-scale corpora demonstrate impressive capabilities on diverse downstream tasks. However, there is increasing concern whether such capabilities might arise from evaluation datasets being included in the pre-training corpus -- a phenomenon known as \textit{data contamination} -- in a manner that artificially increases performance. There has been little understanding of how this potential contamination might influence LMs' performance on downstream tasks. In this paper, we explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models \textit{from scratch}. We highlight the effect of both text contamination (\textit{i.e.}\ input text of the evaluation samples) and ground-truth contamination (\textit{i.e.}\ the prompts asked on the input and the desired outputs) from evaluation data. We also investigate the effects of repeating contamination for various downstream tasks. Additionally, we examine the prevailing n-gram-based definitions of contamination within current LLM reports, pinpointing their limitations and inadequacy. Our findings offer new insights into data contamination's effects on language model capabilities and underscore the need for independent, comprehensive contamination assessments in LLM studies.
Paper Structure (20 sections, 5 figures, 7 tables)

This paper contains 20 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Evaluation results for different contamination factors from 0 to 20 on each dataset. Zero repetitions refer to models pre-trained on the original corpus. In the top three figures, the solid lines and the dotted lines show the ground-truth and text contamination results respectively.
  • Figure 2: Evaluation results on removing contamination from the pre-training corpus. We deliberately select the parameters to achieve different ratios of removed tokens. The x-axis denotes the cleaning method (n-gram or Llama 2) followed by the percentage of tokens removed.
  • Figure 3: The evaluation results for AG News dataset on both contamination factor and removing contaminated data experiments. The performances for public model checkpoints from OpenAI are displayed as dotted lines in both figures.
  • Figure 4: N-gram direct overlap contamination ratio w.r.t. different n-gram values for each dataset.
  • Figure 5: Contamination ratio for pre-training data based on Llama 2's definitions. We adopt the n-gram values that make the contamination ratio within a similar range and threshold from $60\% - 90\%$