Data Contamination Through the Lens of Time

Manley Roberts; Himanshu Thakur; Christine Herlihy; Colin White; Samuel Dooley

Data Contamination Through the Lens of Time

Manley Roberts, Himanshu Thakur, Christine Herlihy, Colin White, Samuel Dooley

TL;DR

<3-5 sentence high-level summary> This study tackles data contamination and memorization in large language model benchmarks by exploiting natural experiments created by known training data cutoffs. It analyzes Codeforces and Project Euler problems released across many years, comparing pre-cutoff (likely seen) vs post-cutoff (likely unseen) problem performance using regression-based analyses with GitHub presence and difficulty as key predictors. The authors find statistically significant contamination signals: pre-cutoff problems with GitHub exposure show higher pass rates for GPT-4 and GPT-3.5-Turbo, while post-cutoff performance loses that advantage, supporting the contamination/memorization hypothesis. They also report robust findings on title/tag reproduction and provide open-source datasets and evaluation code, arguing for dynamic, continuous benchmark release and evaluation practices in the era of web-scale training data.

Abstract

Recent claims about the impressive abilities of large language models (LLMs) are often supported by evaluating publicly available benchmarks. Since LLMs train on wide swaths of the internet, this practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. Data contamination remains notoriously challenging to measure and mitigate, even with partial attempts like controlled experimentation of training data, canary strings, or embedding similarities. In this work, we conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models to look at benchmarks released over time. Specifically, we consider two code/mathematical problem-solving datasets, Codeforces and Project Euler, and find statistically significant trends among LLM pass rate vs. GitHub popularity and release date that provide strong evidence of contamination. By open-sourcing our dataset, raw results, and evaluation framework, our work paves the way for rigorous analyses of data contamination in modern models. We conclude with a discussion of best practices and future steps for publicly releasing benchmarks in the age of LLMs that train on webscale data.

Data Contamination Through the Lens of Time

TL;DR

Abstract

Paper Structure (29 sections, 2 equations, 15 figures, 16 tables)

This paper contains 29 sections, 2 equations, 15 figures, 16 tables.

Introduction
Related Work
Evaluation of Code Generation Models
Adversarial Filtering and Adaptive Benchmarks in NLP
Memorization and Contamination in LLMs
Dataset Construction
Codeforces
Project Euler
Methodological Approach
Independent Variables
Dependent Variables
Results
Pass Rate
GitHub Presence
Difficulty
...and 14 more sections

Figures (15)

Figure 1: Marginal Effects of Pass Rate Metric for GPT-4 on the Codeforces Dataset. Observe a positive association between GitHub Presence before the cutoff but not after. Also, there is a negative association between Difficulty and pass rate both before and after the cutoff.
Figure 2: Marginal Effects of Pass Rate for GPT-4 on the Codeforces Dataset
Figure 3: Marginal Effects of Pass Rate for GPT-3.5-Turbo on the Codeforces Dataset
Figure 4: Marginal Effects of Pass Rate for GPT-4 on the Codeforces Dataset (evaluated on public test cases only)
Figure 5: Marginal Effects of Pass Rate for GPT-3.5-Turbo on the Codeforces Dataset (evaluated on public test cases only)
...and 10 more figures

Data Contamination Through the Lens of Time

TL;DR

Abstract

Data Contamination Through the Lens of Time

Authors

TL;DR

Abstract

Table of Contents

Figures (15)