Fixing Function-Level Code Generation Errors for Foundation Large Language Models

Hao Wen; Yueheng Zhu; Chao Liu; Xiaoxue Ren; Weiwei Du; Meng Yan

Fixing Function-Level Code Generation Errors for Foundation Large Language Models

Hao Wen, Yueheng Zhu, Chao Liu, Xiaoxue Ren, Weiwei Du, Meng Yan

TL;DR

Foundation LLMs still produce many function-level code generation errors. We perform a large-scale empirical study across 14 LLMs on HumanEval, compiling $12{,}837$ errors and identifying $19$ root causes, of which only three types are directly fixable: missing imports, function overflow, and inconsistent indentation. We propose LlmFix, a three-step static-analysis fixer that yields a fix rate of $17.1\%$ on LlmErrorEval, outperforming the baseline LDB by $8.9\%$, and achieving $24.9\%$ when combined with LDB. Applying LlmFix to $14$ LLMs improves function-level code generation accuracy by $7.5\%$ on average on HumanEval and MBPP, with a per-fix cost of $11.5\mathrm{ms}$. The work provides a practical benchmark, demonstrates that root-cause–driven fixes can be fast and effective, and offers guidance for future static-analysis approaches and evaluation datasets.

Abstract

Function-level code generation leverages foundation Large Language Models (LLMs) to automatically produce source code with expected functionality. It has been widely investigated and applied in intelligent programming assistants, such as GitHub Copilot, to enhance software development productivity. Despite advancements in foundation LLMs, the generation involves many errors. Existing studies leverage static analysis tools (e.g., TBar) or add another fixing LLM (i.e., LDB) to post-process these errors. However, there are still many errors remaining to be solved because their root causes have not been investigated yet, making it challenging to design better fixing tools. In this paper, we first conducted an empirical study on the generation errors. Specifically, we reproduced 14 representative LLMs on the HumanEval dataset and verified their correctness. We obtained 12,837 code generation errors and conducted an analysis of their causes, leading to 19 categories of error causes. Our empirical analysis indicated that three of these causes can be directly fixed. Based on the findings, we proposed a fixing method called LlmFix, which addresses these three types of errors through a three-step process: filtering code for indentation correction, truncating redundant generated code, and importing missing modules. Evaluations of LlmFix are conducted from two perspectives: its performance on error-fixing tasks and its impact on improving function-level code generation tasks. For error fixing performance, we built an evaluation dataset LlmErrorEval. Experimental results show that LlmFix achieves a fix rate of 17.1% outperforming the best LDB by 8.9%. For code generation improvements, evaluations of LlmFix on both the HumanEval and MBPP datasets demonstrate its effectiveness, improving code generation accuracy by an average of 7.5% across 14 LLMs.

Fixing Function-Level Code Generation Errors for Foundation Large Language Models

TL;DR

Foundation LLMs still produce many function-level code generation errors. We perform a large-scale empirical study across 14 LLMs on HumanEval, compiling

errors and identifying

root causes, of which only three types are directly fixable: missing imports, function overflow, and inconsistent indentation. We propose LlmFix, a three-step static-analysis fixer that yields a fix rate of

on LlmErrorEval, outperforming the baseline LDB by

, and achieving

when combined with LDB. Applying LlmFix to

LLMs improves function-level code generation accuracy by

on average on HumanEval and MBPP, with a per-fix cost of

. The work provides a practical benchmark, demonstrates that root-cause–driven fixes can be fast and effective, and offers guidance for future static-analysis approaches and evaluation datasets.

Abstract

Paper Structure (39 sections, 7 figures, 8 tables, 1 algorithm)

This paper contains 39 sections, 7 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Function-Level Code Generation
Error Fixing for Foundation LLMs
Empirical Study
Research Questions
Dataset and Evaluation Measure
Can We Reproduce the Performance of Existing Foundation LLMs? (RQ1)
Motivation and Method
LLMs Reproduction and Evaluation
Why did These LLMs Fail in Code Generation? (RQ2)
Motivation and Method
Error Classification and Causes Analysis
Distribution of Error Types
The relationship between failed samples and errors
...and 24 more sections

Figures (7)

Figure 1: The distribution of test results for 14 models, sorted by the mean values from lowest to highest, where the thin vertical lines represent the means; the thick short lines inside the boxes represent the medians; and the stars indicate the reported performance in the related papers zheng2023survey.
Figure 2: After testing each sample in the dataset, the output file displays the structure of the samples, including an example of a sample that passed the test and another that did not pass the test.
Figure 3: Distribution of total errors across 14 LLMs in ten test runs.
Figure 4: Taxonomy of errors introduced while generating code using LLM.
Figure 5: Flowchart of the method LlmFix.
...and 2 more figures

Fixing Function-Level Code Generation Errors for Foundation Large Language Models

TL;DR

Abstract

Fixing Function-Level Code Generation Errors for Foundation Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)