Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models
Zhijie Wang, Zijie Zhou, Da Song, Yuheng Huang, Shengmai Chen, Lei Ma, Tianyi Zhang
TL;DR
This work targets the limited understanding of code-generation errors from large language models (LLMs) by conducting an empirical analysis on six representative LLMs using the HumanEval benchmark. Through open coding and thematic analysis, the authors derive a fine-grained taxonomy of semantic and syntactic error characteristics, quantify repair effort with multiple similarity metrics, and examine how task complexity and test-pass rates relate to error types. Key findings include the predominance of non-trivial, multi-line errors with model-dependent root causes, significant repair effort requirements, and the potential to improve repairs by signaling error characteristics to repair agents. The study provides practical guidance for fault localization, task-difficulty estimation, and taxonomy-guided debugging, along with resources such as an interactive website and code-label datasets to support future research.
Abstract
Large Language Models (LLMs) have demonstrated unprecedented capabilities in code generation. However, there remains a limited understanding of code generation errors that LLMs can produce. To bridge the gap, we conducted an in-depth analysis of code generation errors across six representative LLMs on the HumanEval dataset. Specifically, we first employed open coding and thematic analysis to distill a comprehensive taxonomy of code generation errors. We analyzed two dimensions of error characteristics -- semantic characteristics and syntactic characteristics. Our analysis revealed that LLMs often made non-trivial, multi-line code generation errors in various locations and with various root causes. We further analyzed the correlation between these errors and task complexity as well as test pass rate. Our findings highlighted several challenges in locating and fixing code generation errors made by LLMs. In the end, we discussed several future directions to address these challenges.
