On Leakage of Code Generation Evaluation Datasets

Alexandre Matton; Tom Sherborne; Dennis Aumiller; Elena Tommasone; Milad Alizadeh; Jingyi He; Raymond Ma; Maxime Voisin; Ellen Gilsenan-McMahon; Matthias Gallé

On Leakage of Code Generation Evaluation Datasets

Alexandre Matton, Tom Sherborne, Dennis Aumiller, Elena Tommasone, Milad Alizadeh, Jingyi He, Raymond Ma, Maxime Voisin, Ellen Gilsenan-McMahon, Matthias Gallé

TL;DR

This paper considers contamination by code generation test sets, in particular in their use in modern large language models, and releases Less Basic Python Problems (LBPP), an uncontaminated new benchmark of 161 prompts with their associated Python solutions.

Abstract

In this paper, we consider contamination by code generation test sets, in particular in their use in modern large language models. We discuss three possible sources of such contamination and show findings supporting each of them: (i) direct data leakage, (ii) indirect data leakage through the use of synthetic data and (iii) overfitting to evaluation sets during model selection. To address this, we release Less Basic Python Problems (LBPP): an uncontaminated new benchmark of 161 prompts with their associated Python solutions. LBPP is released at https://huggingface.co/datasets/CohereForAI/lbpp .

On Leakage of Code Generation Evaluation Datasets

TL;DR

Abstract

Paper Structure (13 sections, 3 figures, 5 tables)

This paper contains 13 sections, 3 figures, 5 tables.

Introduction
Related Work
Possible sources of contamination
Direct data leakage
Data leakage through synthetic data
Overfitting to test sets
LBPP: Less Basic Python Problems
Dataset Annotation:
Initial Results:
Challenges in LBPP:
Conclusion
Limitations
Appendix

Figures (3)

Figure 1: Histogram (excluding outliers) of occurrences for HumanEval prompts in public GitHub repositories. Every prompt occurs at least $43$ times.
Figure 2: Histogram of cosine similarities for prompts in HumanEval, MBPP and LBPP relative to two popular synthetic code training datasets. We note the high similarity between most HumanEval prompts to evol-instruct, and how LBPP has reduced overall similarity to either training dataset.
Figure 3: Pass@1 rate of LBPP against (a) HumanEval and (b) MBPP. LBPP performance correlates with both prior datasets, but is designed to be genuinely unseen by contemporary LLMs.

On Leakage of Code Generation Evaluation Datasets

TL;DR

Abstract

On Leakage of Code Generation Evaluation Datasets

Authors

TL;DR

Abstract

Table of Contents

Figures (3)