Fixing Function-Level Code Generation Errors for Foundation Large Language Models
Hao Wen, Yueheng Zhu, Chao Liu, Xiaoxue Ren, Weiwei Du, Meng Yan
TL;DR
Foundation LLMs still produce many function-level code generation errors. We perform a large-scale empirical study across 14 LLMs on HumanEval, compiling $12{,}837$ errors and identifying $19$ root causes, of which only three types are directly fixable: missing imports, function overflow, and inconsistent indentation. We propose LlmFix, a three-step static-analysis fixer that yields a fix rate of $17.1\%$ on LlmErrorEval, outperforming the baseline LDB by $8.9\%$, and achieving $24.9\%$ when combined with LDB. Applying LlmFix to $14$ LLMs improves function-level code generation accuracy by $7.5\%$ on average on HumanEval and MBPP, with a per-fix cost of $11.5\mathrm{ms}$. The work provides a practical benchmark, demonstrates that root-cause–driven fixes can be fast and effective, and offers guidance for future static-analysis approaches and evaluation datasets.
Abstract
Function-level code generation leverages foundation Large Language Models (LLMs) to automatically produce source code with expected functionality. It has been widely investigated and applied in intelligent programming assistants, such as GitHub Copilot, to enhance software development productivity. Despite advancements in foundation LLMs, the generation involves many errors. Existing studies leverage static analysis tools (e.g., TBar) or add another fixing LLM (i.e., LDB) to post-process these errors. However, there are still many errors remaining to be solved because their root causes have not been investigated yet, making it challenging to design better fixing tools. In this paper, we first conducted an empirical study on the generation errors. Specifically, we reproduced 14 representative LLMs on the HumanEval dataset and verified their correctness. We obtained 12,837 code generation errors and conducted an analysis of their causes, leading to 19 categories of error causes. Our empirical analysis indicated that three of these causes can be directly fixed. Based on the findings, we proposed a fixing method called LlmFix, which addresses these three types of errors through a three-step process: filtering code for indentation correction, truncating redundant generated code, and importing missing modules. Evaluations of LlmFix are conducted from two perspectives: its performance on error-fixing tasks and its impact on improving function-level code generation tasks. For error fixing performance, we built an evaluation dataset LlmErrorEval. Experimental results show that LlmFix achieves a fix rate of 17.1% outperforming the best LDB by 8.9%. For code generation improvements, evaluations of LlmFix on both the HumanEval and MBPP datasets demonstrate its effectiveness, improving code generation accuracy by an average of 7.5% across 14 LLMs.
