Table of Contents
Fetching ...

Are Large Language Models Memorizing Bug Benchmarks?

Daniel Ramos, Claudia Mamede, Kush Jain, Paulo Canelas, Catarina Gamboa, Claire Le Goues

TL;DR

This work investigates data leakage in bug benchmarks used to evaluate large language models for software engineering tasks. By combining benchmark membership checks, negative log-likelihood, and 5-gram accuracy, the study quantifies memorization across open-source base models and multiple benchmarks, with Defects4J showing the strongest leakage signals. The authors demonstrate that older benchmarks and smaller models are more prone to memorize patch content, whereas newer models with larger training budgets exhibit reduced leakage, though some memorization persists. The findings highlight the need for careful benchmark selection and multi-metric evaluation to ensure reliable assessments of model capabilities in code-related tasks, and they advocate incorporating newer benchmarks like GitBug-Java to mitigate data contamination. Overall, the paper provides a framework for assessing data leakage risks in code benchmarks and offers practical guidance for designing robust evaluations of LLMs in software engineering contexts.

Abstract

Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world bugs from software projects have been developed. However, a growing concern within the software engineering community is that these benchmarks may not reliably reflect true LLM performance due to the risk of data leakage. Despite this concern, limited research has been conducted to quantify the impact of potential leakage. In this paper, we systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks. To identify potential leakage, we use multiple metrics, including a study of benchmark membership within commonly used training datasets, as well as analyses of negative log-likelihood and n-gram accuracy. Our findings show that certain models, in particular codegen-multi, exhibit significant evidence of memorization in widely used benchmarks like Defects4J, while newer models trained on larger datasets like LLaMa 3.1 exhibit limited signs of leakage. These results highlight the need for careful benchmark selection and the adoption of robust metrics to adequately assess models capabilities.

Are Large Language Models Memorizing Bug Benchmarks?

TL;DR

This work investigates data leakage in bug benchmarks used to evaluate large language models for software engineering tasks. By combining benchmark membership checks, negative log-likelihood, and 5-gram accuracy, the study quantifies memorization across open-source base models and multiple benchmarks, with Defects4J showing the strongest leakage signals. The authors demonstrate that older benchmarks and smaller models are more prone to memorize patch content, whereas newer models with larger training budgets exhibit reduced leakage, though some memorization persists. The findings highlight the need for careful benchmark selection and multi-metric evaluation to ensure reliable assessments of model capabilities in code-related tasks, and they advocate incorporating newer benchmarks like GitBug-Java to mitigate data contamination. Overall, the paper provides a framework for assessing data leakage risks in code benchmarks and offers practical guidance for designing robust evaluations of LLMs in software engineering contexts.

Abstract

Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world bugs from software projects have been developed. However, a growing concern within the software engineering community is that these benchmarks may not reliably reflect true LLM performance due to the risk of data leakage. Despite this concern, limited research has been conducted to quantify the impact of potential leakage. In this paper, we systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks. To identify potential leakage, we use multiple metrics, including a study of benchmark membership within commonly used training datasets, as well as analyses of negative log-likelihood and n-gram accuracy. Our findings show that certain models, in particular codegen-multi, exhibit significant evidence of memorization in widely used benchmarks like Defects4J, while newer models trained on larger datasets like LLaMa 3.1 exhibit limited signs of leakage. These results highlight the need for careful benchmark selection and the adoption of robust metrics to adequately assess models capabilities.

Paper Structure

This paper contains 15 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Excerpt from Defects4J (Lang:Bug 39). Given the first lines of the function until line 11, codegen-multi generated lines 12 to 23, matching the benchmark solution.
  • Figure 2: Overview of our methodology for detecting leakage. We collected bug benchmarks and unseen repositories from 2024. We evaluated NLL and $N$-gram accuracy on base models, and analyzed membership of the benchmarks in TheStack.
  • Figure 3: NLL by model and dataset. NLL is not comparable across models in different families, only across benchmarks within a family. NLL for other models are consistent with the results displayed.
  • Figure 4: Heatmap illustrating the relative NLL ratios across datasets for the codegen-multi, LLaMa 3.1, and Gemma 2. Each cell represents the ratio of the NLL for the dataset in the column to that of the dataset in the row. For example, the NLL for new Java repos is $5.63\times$ higher than that for Defects4J. Darker colors correspond to higher ratios.
  • Figure 5: 5-gram accuracy by model and dataset. Due to space constraints, we selected a sample of the most relevant models. 5-gram accuracy for other models are consistent with the results displayed.
  • ...and 5 more figures