Table of Contents
Fetching ...

On Leakage of Code Generation Evaluation Datasets

Alexandre Matton, Tom Sherborne, Dennis Aumiller, Elena Tommasone, Milad Alizadeh, Jingyi He, Raymond Ma, Maxime Voisin, Ellen Gilsenan-McMahon, Matthias Gallé

TL;DR

This paper considers contamination by code generation test sets, in particular in their use in modern large language models, and releases Less Basic Python Problems (LBPP), an uncontaminated new benchmark of 161 prompts with their associated Python solutions.

Abstract

In this paper, we consider contamination by code generation test sets, in particular in their use in modern large language models. We discuss three possible sources of such contamination and show findings supporting each of them: (i) direct data leakage, (ii) indirect data leakage through the use of synthetic data and (iii) overfitting to evaluation sets during model selection. To address this, we release Less Basic Python Problems (LBPP): an uncontaminated new benchmark of 161 prompts with their associated Python solutions. LBPP is released at https://huggingface.co/datasets/CohereForAI/lbpp .

On Leakage of Code Generation Evaluation Datasets

TL;DR

This paper considers contamination by code generation test sets, in particular in their use in modern large language models, and releases Less Basic Python Problems (LBPP), an uncontaminated new benchmark of 161 prompts with their associated Python solutions.

Abstract

In this paper, we consider contamination by code generation test sets, in particular in their use in modern large language models. We discuss three possible sources of such contamination and show findings supporting each of them: (i) direct data leakage, (ii) indirect data leakage through the use of synthetic data and (iii) overfitting to evaluation sets during model selection. To address this, we release Less Basic Python Problems (LBPP): an uncontaminated new benchmark of 161 prompts with their associated Python solutions. LBPP is released at https://huggingface.co/datasets/CohereForAI/lbpp .
Paper Structure (13 sections, 3 figures, 5 tables)

This paper contains 13 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Histogram (excluding outliers) of occurrences for HumanEval prompts in public GitHub repositories. Every prompt occurs at least $43$ times.
  • Figure 2: Histogram of cosine similarities for prompts in HumanEval, MBPP and LBPP relative to two popular synthetic code training datasets. We note the high similarity between most HumanEval prompts to evol-instruct, and how LBPP has reduced overall similarity to either training dataset.
  • Figure 3: Pass@1 rate of LBPP against (a) HumanEval and (b) MBPP. LBPP performance correlates with both prior datasets, but is designed to be genuinely unseen by contemporary LLMs.