What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Shihan Dou; Haoxiang Jia; Shenxi Wu; Huiyuan Zheng; Muling Wu; Yunbo Tao; Ming Zhang; Mingxu Chai; Jessica Fan; Zhiheng Xi; Rui Zheng; Yueming Wu; Ming Wen; Tao Gui; Qi Zhang; Xipeng Qiu; Xuanjing Huang

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Muling Wu, Yunbo Tao, Ming Zhang, Mingxu Chai, Jessica Fan, Zhiheng Xi, Rui Zheng, Yueming Wu, Ming Wen, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang

TL;DR

The paper conducts an extensive, multi-dimensional evaluation of nine LLMs on Python code generation across standard benchmarks and a novel real-world benchmark, revealing systematic limitations as problems grow in complexity. It introduces a three-tier bug taxonomy (syntax, runtime, functional) with ten subtypes, supported by a two-stage annotation process and detailed bug analyses. A training-free self-critique method leverages bug taxonomy and compiler feedback to iteratively repair generated code, achieving a 29.2% repair rate after two iterations. The work highlights significant gaps between benchmark and real-world performance and demonstrates the potential of self-critique approaches to improve code quality without additional model training.

Abstract

The increasing development of LLMs in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and leveraging diverse training technologies. However, there is a notable lack of comprehensive studies examining the limitations and boundaries of existing methods. To bridge this gap, we conducted an extensive empirical study evaluating the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks. Our investigation, which evaluated the length, cyclomatic complexity and API number of the generated code, revealed that these LLMs face challenges in generating successful code for more complex problems, and tend to produce code that is shorter yet more complicated as compared to canonical solutions. Additionally, we developed a taxonomy of bugs for incorrect codes that includes three categories and ten sub-categories, and analyzed the root cause for common bug types. To better understand the performance of LLMs in real-world projects, we also manually created a real-world benchmark RWPB. We analyzed bugs on RWPB to highlight distinct differences in bug distributions between actual scenarios and existing benchmarks. Finally, we propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback. Our comprehensive and extensive study provides insights into the current limitations of LLM-based code generation and opportunities for enhancing the accuracy and quality of the generated code.

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

TL;DR

Abstract

Paper Structure (30 sections, 7 figures, 9 tables)

This paper contains 30 sections, 7 figures, 9 tables.

Introduction
Experimental Design
LLM-Based Code Generation
Effectiveness of LLMs in Code Generation.
Factors for LLMs in Code Generation
Bugs in Code Generated by LLMs
Taxonomy of Code Generation Bugs
Methodology
Type A: Syntax Bug.
Type B: Runtime Bug.
Type C: Functional Bug.
Bug Analysis
Code Generation in Real-World Projects
Benchmark Construction
Effectiveness in RWPB and Bugs Analysis
...and 15 more sections

Figures (7)

Figure 1: The differences of code characteristics between code correctly generated by models and canonical solutions on HumanEval+. CC denotes cyclomatic complexity.
Figure 2: The difference in the comment-to-code-line ratio between the correct and incorrect code generated by LLMs. SC2 denotes StarCoder-2. DC denotes DeepSeekCoder. LL3 denotes Llama-3. CL3 denotes Claude-3. DSV denotes DeepSeek-V3. DSR denotes DeepSeek-R1. The bold label indicates comments in the incorrect code are significantly (i.e.,$p < 0.05$) higher than in the correct code.
Figure 3: Taxonomy of bugs that occurred in code generated by LLMs.
Figure 4: Distribution of misunderstanding and logic error.
Figure 5: The process of constructing RWPB.
...and 2 more figures

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

TL;DR

Abstract

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Authors

TL;DR

Abstract

Table of Contents

Figures (7)