Table of Contents
Fetching ...

Bugs in Large Language Models Generated Code: An Empirical Study

Florian Tambon, Arghavan Moradi Dakhel, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, Giuliano Antoniol

TL;DR

This study systematically characterizes bugs in code produced by three leading LLMs, using the CoderEval benchmark to extract 333 buggy samples and reveal a 10-pattern taxonomy. It combines qualitative open coding with an empirical distribution across models and a validation survey of practitioners/researchers to confirm relevance and prevalence. The findings show distinct, non-human-like bug patterns (e.g., Hallucinated Object, Prompt-biased code) and reveal that Misinterpretation and Missing Corner Cases are common yet challenging to diagnose and fix, informing quality assurance and repair strategies. The work provides a foundation for targeted testing, benchmarks, and interpretability approaches to improve the reliability of LLM-based code generation in real software development contexts.

Abstract

Large Language Models (LLMs) for code have gained significant attention recently. They can generate code in different programming languages based on provided prompts, fulfilling a long-lasting dream in Software Engineering (SE), i.e., automatic code generation. Similar to human-written code, LLM-generated code is prone to bugs, and these bugs have not yet been thoroughly examined by the community. Given the increasing adoption of LLM-based code generation tools (e.g., GitHub Copilot) in SE activities, it is critical to understand the characteristics of bugs contained in code generated by LLMs. This paper examines a sample of 333 bugs collected from code generated using three leading LLMs (i.e., CodeGen, PanGu-Coder, and Codex) and identifies the following 10 distinctive bug patterns: Misinterpretations, Syntax Error, Silly Mistake, Prompt-biased code, Missing Corner Case, Wrong Input Type, Hallucinated Object, Wrong Attribute, Incomplete Generation, and Non-Prompted Consideration. The bug patterns are presented in the form of a taxonomy. The identified bug patterns are validated using an online survey with 34 LLM practitioners and researchers. The surveyed participants generally asserted the significance and prevalence of the bug patterns. Researchers and practitioners can leverage these findings to develop effective quality assurance techniques for LLM-generated code. This study sheds light on the distinctive characteristics of LLM-generated code.

Bugs in Large Language Models Generated Code: An Empirical Study

TL;DR

This study systematically characterizes bugs in code produced by three leading LLMs, using the CoderEval benchmark to extract 333 buggy samples and reveal a 10-pattern taxonomy. It combines qualitative open coding with an empirical distribution across models and a validation survey of practitioners/researchers to confirm relevance and prevalence. The findings show distinct, non-human-like bug patterns (e.g., Hallucinated Object, Prompt-biased code) and reveal that Misinterpretation and Missing Corner Cases are common yet challenging to diagnose and fix, informing quality assurance and repair strategies. The work provides a foundation for targeted testing, benchmarks, and interpretability approaches to improve the reliability of LLM-based code generation in real software development contexts.

Abstract

Large Language Models (LLMs) for code have gained significant attention recently. They can generate code in different programming languages based on provided prompts, fulfilling a long-lasting dream in Software Engineering (SE), i.e., automatic code generation. Similar to human-written code, LLM-generated code is prone to bugs, and these bugs have not yet been thoroughly examined by the community. Given the increasing adoption of LLM-based code generation tools (e.g., GitHub Copilot) in SE activities, it is critical to understand the characteristics of bugs contained in code generated by LLMs. This paper examines a sample of 333 bugs collected from code generated using three leading LLMs (i.e., CodeGen, PanGu-Coder, and Codex) and identifies the following 10 distinctive bug patterns: Misinterpretations, Syntax Error, Silly Mistake, Prompt-biased code, Missing Corner Case, Wrong Input Type, Hallucinated Object, Wrong Attribute, Incomplete Generation, and Non-Prompted Consideration. The bug patterns are presented in the form of a taxonomy. The identified bug patterns are validated using an online survey with 34 LLM practitioners and researchers. The surveyed participants generally asserted the significance and prevalence of the bug patterns. Researchers and practitioners can leverage these findings to develop effective quality assurance techniques for LLM-generated code. This study sheds light on the distinctive characteristics of LLM-generated code.
Paper Structure (30 sections, 8 figures, 3 tables)

This paper contains 30 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: High-level view of the followed methodology.
  • Figure 2: Final taxonomy of bug patterns in code generated by LLM. The number following each category represents the percentage of code samples assigned to that category during manual labeling.
  • Figure 3: The heatmap illustrates the distribution of bug patterns on various tasks across all LLMs. The values are normalized over lines per task. The brighter areas represent a higher number of code samples of a specific task categorized into a particular bug pattern. The black areas indicate the lack of code samples in the bug pattern for a specific task.
  • Figure 4: Aggregated results of the validation survey. Questions related to the frequency of encounter of bug patterns, the difficulty to diagnose and fix them as well as the complexity of the bug. We highlight in bold the highest number in each category for each bug pattern. 1 represents never/easy/trivial/low and 5 represents always/hard/complex/high.
  • Figure 5: The posted survey on Reddit.
  • ...and 3 more figures