Table of Contents
Fetching ...

Detecting Data Contamination in LLMs via In-Context Learning

Michał Zawalski, Meriem Boubdir, Klaudia Bałazy, Besmira Nushi, Pablo Ribalta

TL;DR

CoDeC introduces a model- and dataset-agnostic method for detecting training-data contamination in LLMs by measuring how in-context examples from a suspect dataset affect predictions, identifying memorization signals through negative shifts in log-probability. The approach yields a dataset-level contamination score with near-perfect separation between seen and unseen data (AUC around $99.9\%$) across diverse models, baselines, and benchmarks, and it remains robust to training dynamics and finetuning. Empirical results show contamination signals arise early in training, intensify with finetuning, and transfer to related distributions, while larger models tend to exhibit lower contamination due to improved generalization. CoDeC is efficient (two forward passes per sample, gray-box access) and provides actionable guidance for fair benchmark evaluation and model development, enabling ongoing trust in published results.

Abstract

We present Contamination Detection via Context (CoDeC), a practical and accurate method to detect and quantify training data contamination in large language models. CoDeC distinguishes between data memorized during training and data outside the training distribution by measuring how in-context learning affects model performance. We find that in-context examples typically boost confidence for unseen datasets but may reduce it when the dataset was part of training, due to disrupted memorization patterns. Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets, and reveals strong evidence of memorization in open-weight models with undisclosed training corpora. The method is simple, automated, and both model- and dataset-agnostic, making it easy to integrate with benchmark evaluations.

Detecting Data Contamination in LLMs via In-Context Learning

TL;DR

CoDeC introduces a model- and dataset-agnostic method for detecting training-data contamination in LLMs by measuring how in-context examples from a suspect dataset affect predictions, identifying memorization signals through negative shifts in log-probability. The approach yields a dataset-level contamination score with near-perfect separation between seen and unseen data (AUC around ) across diverse models, baselines, and benchmarks, and it remains robust to training dynamics and finetuning. Empirical results show contamination signals arise early in training, intensify with finetuning, and transfer to related distributions, while larger models tend to exhibit lower contamination due to improved generalization. CoDeC is efficient (two forward passes per sample, gray-box access) and provides actionable guidance for fair benchmark evaluation and model development, enabling ongoing trust in published results.

Abstract

We present Contamination Detection via Context (CoDeC), a practical and accurate method to detect and quantify training data contamination in large language models. CoDeC distinguishes between data memorized during training and data outside the training distribution by measuring how in-context learning affects model performance. We find that in-context examples typically boost confidence for unseen datasets but may reduce it when the dataset was part of training, due to disrupted memorization patterns. Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets, and reveals strong evidence of memorization in open-weight models with undisclosed training corpora. The method is simple, automated, and both model- and dataset-agnostic, making it easy to integrate with benchmark evaluations.

Paper Structure

This paper contains 74 sections, 1 equation, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Change in model log-probabilities with and without context for the Pythia 1.4B model, evaluated on a sample from an unseen dataset (gsm8k, \ref{['fig:logprob_diff_ood']}) and a training dataset (ArXiv, \ref{['fig:logprob_diff_id']}). Differences in confidence for consecutive tokens are small but consistent. See Appendix \ref{['app:experiments_extended_logits']} for details of this experiment.
  • Figure 2: Overview of Contamination Detection via Context (CoDeC). For each dataset element, CoDeC augments the context with a few other samples from the same dataset. A decrease in the model’s logits for the target sample indicates potential contamination. The overall contamination level is estimated as the fraction of samples exhibiting this effect.
  • Figure 3: CoDeC vs. baselines; Contamination scores for training (red) and unseen (blue) datasets. Each point is a model–dataset pair. CoDeC achieves the best separation, enabling consistent classification across models and datasets.
  • Figure 4: Changes in CoDeC scores over training of OLMo 7B. A list of the datasets evaluated in this experiment can be found in Appendix \ref{['app:experiments_extended_progress']}.
  • Figure 5: Distribution of CoDeC scores per dataset, computed for models trained on the Pile. The last 5 datasets are parts of the training data. The complete evaluation can be found in Appendix \ref{['app:experiments_extended_per_dataset']}.
  • ...and 10 more figures