Table of Contents
Fetching ...

LLMs are Bug Replicators: An Empirical Study on LLMs' Capability in Completing Bug-prone Code

Liwei Guo, Sixiang Ye, Zeyu Sun, Xiang Chen, Yuxia Zhang, Bo Wang, Jie M. Zhang, Zheng Li, Yong Liu

TL;DR

This work experimentally evaluates seven state-of-the-art LLMs on bug-prone Java code using the Defects4J dataset, revealing that bug-prone completion is markedly harder than normal code completion. By combining LCS and LED metrics to distinguish outputs aligned with historical bugs or fixes, the study finds a near 1:1 ratio of correct to buggy completions, substantial memorization of past bugs, and limited efficacy of post-processing techniques. The analysis identifies code constructs like method invocations and return statements as particularly error-prone and demonstrates that newer reasoning models do not yet achieve reliable bug-prone completion. The findings highlight a critical need for better training data filtering, advanced error handling, and more effective post-processing to safely integrate LLM-based code completion into real-world development environments.

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance in code completion. However, the training data used to develop these models often contain a significant amount of buggy code. Yet, it remains unclear to what extent these buggy instances influence LLMs' performance when tackling bug-prone code completion tasks. To fill this gap, this paper presents the first empirical study evaluating the performance of LLMs in completing bug-prone code. Through extensive experiments on 7 LLMs and the Defects4J dataset, we analyze LLMs' accuracy, robustness, and limitations in this challenging context. Our experimental results show that completing bug-prone code is significantly more challenging for LLMs than completing normal code. Notably, in bug-prone tasks, the likelihood of LLMs generating correct code is nearly the same as generating buggy code, and it is substantially lower than in normal code completion tasks (e.g., 12.27% vs. 29.85% for GPT-4). To our surprise, 44.44% of the bugs LLMs make are completely identical to the pre-fix version, indicating that LLMs have been seriously biased by historical bugs when completing code. Additionally, we investigate the effectiveness of existing post-processing techniques and find that while they can improve consistency, they do not significantly reduce error rates in bug-prone code scenarios. Our research highlights the limitations of current LLMs in handling bug-prone code and underscores the need for improved models and post-processing strategies to enhance code completion accuracy in real-world development environments.

LLMs are Bug Replicators: An Empirical Study on LLMs' Capability in Completing Bug-prone Code

TL;DR

This work experimentally evaluates seven state-of-the-art LLMs on bug-prone Java code using the Defects4J dataset, revealing that bug-prone completion is markedly harder than normal code completion. By combining LCS and LED metrics to distinguish outputs aligned with historical bugs or fixes, the study finds a near 1:1 ratio of correct to buggy completions, substantial memorization of past bugs, and limited efficacy of post-processing techniques. The analysis identifies code constructs like method invocations and return statements as particularly error-prone and demonstrates that newer reasoning models do not yet achieve reliable bug-prone completion. The findings highlight a critical need for better training data filtering, advanced error handling, and more effective post-processing to safely integrate LLM-based code completion into real-world development environments.

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance in code completion. However, the training data used to develop these models often contain a significant amount of buggy code. Yet, it remains unclear to what extent these buggy instances influence LLMs' performance when tackling bug-prone code completion tasks. To fill this gap, this paper presents the first empirical study evaluating the performance of LLMs in completing bug-prone code. Through extensive experiments on 7 LLMs and the Defects4J dataset, we analyze LLMs' accuracy, robustness, and limitations in this challenging context. Our experimental results show that completing bug-prone code is significantly more challenging for LLMs than completing normal code. Notably, in bug-prone tasks, the likelihood of LLMs generating correct code is nearly the same as generating buggy code, and it is substantially lower than in normal code completion tasks (e.g., 12.27% vs. 29.85% for GPT-4). To our surprise, 44.44% of the bugs LLMs make are completely identical to the pre-fix version, indicating that LLMs have been seriously biased by historical bugs when completing code. Additionally, we investigate the effectiveness of existing post-processing techniques and find that while they can improve consistency, they do not significantly reduce error rates in bug-prone code scenarios. Our research highlights the limitations of current LLMs in handling bug-prone code and underscores the need for improved models and post-processing strategies to enhance code completion accuracy in real-world development environments.

Paper Structure

This paper contains 39 sections, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Buggy Code Example Generated By OpenAI-GPT3.5
  • Figure 2: Overview of Our Research Methodology
  • Figure 3: Example of Prompt
  • Figure 4: Analysis of Code Memorization Across Different LLMs
  • Figure 5: Distribution of similarity scores between LLM completions and ground-truth code for Bug-prone vs Normal Code code lines (Experimental Results).
  • ...and 3 more figures