Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

Martin Riddell; Ansong Ni; Arman Cohan

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

Martin Riddell, Ansong Ni, Arman Cohan

TL;DR

This paper tackles data contamination in code-generation benchmarks by quantifying overlap between test solutions (MBPP and HumanEval) and open pretraining corpora (Pile and Stack) using a two-pronged similarity approach. It combines surface-level Levenshtein matching with semantic-level Dolos AST-based comparison, aggregating results via a max-operator to identify contaminated instances. The study analyzes three open-model families across two corpora, revealing non-trivial contamination (ranging from $3.6\%$ to $20.8\%$) and demonstrates that models perform markedly better on contaminated examples, with significant gaps between top- and bottom-overlap groups. Decontamination experiments reduce performance gaps and highlight that much of the observed differences across models may stem from data leakage rather than intrinsic generalization. The work emphasizes the need for open datasets and careful evaluation in code-generation research, and it provides a reproducible pipeline and accompanying results to guide future studies.

Abstract

While large language models have achieved remarkable performance on various code generation benchmarks, there have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and finetuning data. While recent work has investigated contamination in natural language generation and understanding tasks, there has been less extensive research into how data contamination impacts the evaluation of code generation, which is critical for understanding the robustness and reliability of LLMs in programming contexts. In this work, we perform a comprehensive study of data contamination of popular code generation benchmarks, and precisely quantify their overlap with pretraining corpus through both surface-level and semantic-level matching. In our experiments, we show that there are substantial overlap between popular code generation benchmarks and open training corpus, and models perform significantly better on the subset of the benchmarks where similar solutions are seen during training. We also conduct extensive analysis on the factors that affects model memorization and generalization, such as model size, problem difficulty, and question length. We release all resulting files from our matching pipeline for future research.

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

TL;DR

) and demonstrates that models perform markedly better on contaminated examples, with significant gaps between top- and bottom-overlap groups. Decontamination experiments reduce performance gaps and highlight that much of the observed differences across models may stem from data leakage rather than intrinsic generalization. The work emphasizes the need for open datasets and careful evaluation in code-generation research, and it provides a reproducible pipeline and accompanying results to guide future studies.

Abstract

Paper Structure (27 sections, 1 equation, 12 figures, 13 tables)

This paper contains 27 sections, 1 equation, 12 figures, 13 tables.

Introduction
Methodology
Measuring Program Similarity
Surface-Level Similarity.
Semantic Similarity.
Quantifying Data Contamination
Aggregating Similarity Scores.
Experimental Setup
Models and Pretraining Data
Benchmarks
Results
Main Results
3.6% to 20.8% of the solutions are seen during training.
Models perform significantly better when similar solutions are seen during training.
De-contaminated results.
...and 12 more sections

Figures (12)

Figure 1: Quantifying data contamination for the Pile and the Stack corpus on two popular benchmarks, MBPP and HumanEval. "Top-1 Score" denotes the similarity score between the gold solution and the most similar program found in the training corpus.
Figure 3: Distribution of different similarity scoring methods on the MBPP dataset. Similar results for HumanEval are shown in \ref{['fig:HE_relevant_score_info']}.
Figure 5: Accuracy of different model series evaluated on a subset of examples with increasing overlap with the model's pretraining data. Subset obtained by using the $x$-axis as a threshold for the minimum score obtained by taking the average aggregated similarity score of top-10 matched programs in the training data.
Figure 6: Gold solution length vs. overlap with training data vs. model prediction correctness, for StarCoderBase-15.5B on MBPP. Similar results for HumanEval are shown in \ref{['fig:he_size_against_similarity']}.
Figure 11: We show the similarity scores for both The Pile and The Stack found by searching for answers to the gold programs in the HumanEval benchmark. We compare the similarity scores from different techniques, as well as the difference between using the top-1 score and the top-10 scores.
...and 7 more figures

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

TL;DR

Abstract

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)