Table of Contents
Fetching ...

A Taxonomy for Data Contamination in Large Language Models

Medha Palavalli, Amanda Bertsch, Matthew R. Gormley

TL;DR

This paper addresses data contamination in LLMs, where pretraining data may contain a subset $f(D)$ of the test set $D$, inflating evaluation scores. It defines contamination as leakage via a function composition $f=f^{(1)}\circ\cdots\circ f^{(n)}$ and distinguishes dataset-level leakage $g(D)$ from instance-level modifications. Through continued pretraining experiments with GPT-2 Large on summarization and QA tasks, it shows that both in-domain exposure and approximate contamination can boost performance, sometimes more than clean data. The work highlights the need for standardized decontamination practices that address noisy, partial, and format-mismatched contamination to ensure robust, fair benchmarking of LLMs.

Abstract

Large language models pretrained on extensive web corpora demonstrate remarkable performance across a wide range of downstream tasks. However, a growing concern is data contamination, where evaluation datasets may be contained in the pretraining corpus, inflating model performance. Decontamination, the process of detecting and removing such data, is a potential solution; yet these contaminants may originate from altered versions of the test set, evading detection during decontamination. How different types of contamination impact the performance of language models on downstream tasks is not fully understood. We present a taxonomy that categorizes the various types of contamination encountered by LLMs during the pretraining phase and identify which types pose the highest risk. We analyze the impact of contamination on two key NLP tasks -- summarization and question answering -- revealing how different types of contamination influence task performance during evaluation.

A Taxonomy for Data Contamination in Large Language Models

TL;DR

This paper addresses data contamination in LLMs, where pretraining data may contain a subset of the test set , inflating evaluation scores. It defines contamination as leakage via a function composition and distinguishes dataset-level leakage from instance-level modifications. Through continued pretraining experiments with GPT-2 Large on summarization and QA tasks, it shows that both in-domain exposure and approximate contamination can boost performance, sometimes more than clean data. The work highlights the need for standardized decontamination practices that address noisy, partial, and format-mismatched contamination to ensure robust, fair benchmarking of LLMs.

Abstract

Large language models pretrained on extensive web corpora demonstrate remarkable performance across a wide range of downstream tasks. However, a growing concern is data contamination, where evaluation datasets may be contained in the pretraining corpus, inflating model performance. Decontamination, the process of detecting and removing such data, is a potential solution; yet these contaminants may originate from altered versions of the test set, evading detection during decontamination. How different types of contamination impact the performance of language models on downstream tasks is not fully understood. We present a taxonomy that categorizes the various types of contamination encountered by LLMs during the pretraining phase and identify which types pose the highest risk. We analyze the impact of contamination on two key NLP tasks -- summarization and question answering -- revealing how different types of contamination influence task performance during evaluation.
Paper Structure (23 sections, 8 figures, 6 tables)

This paper contains 23 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Taxonomy of Contamination, with some representative works in the literature that address each category.
  • Figure 2: Bar Chart of all SAMSum models compared for Rouge-L.
  • Figure 3: Bar Chart of all SQuAD models compared for Exact Match.
  • Figure 4: Bar Chart of all CNN/Daily Mail models compared for each metric
  • Figure 5: Bar Chart of all SAMSum models compared for each metric
  • ...and 3 more figures