Table of Contents
Fetching ...

Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models

Zhijie Wang, Zijie Zhou, Da Song, Yuheng Huang, Shengmai Chen, Lei Ma, Tianyi Zhang

TL;DR

This work targets the limited understanding of code-generation errors from large language models (LLMs) by conducting an empirical analysis on six representative LLMs using the HumanEval benchmark. Through open coding and thematic analysis, the authors derive a fine-grained taxonomy of semantic and syntactic error characteristics, quantify repair effort with multiple similarity metrics, and examine how task complexity and test-pass rates relate to error types. Key findings include the predominance of non-trivial, multi-line errors with model-dependent root causes, significant repair effort requirements, and the potential to improve repairs by signaling error characteristics to repair agents. The study provides practical guidance for fault localization, task-difficulty estimation, and taxonomy-guided debugging, along with resources such as an interactive website and code-label datasets to support future research.

Abstract

Large Language Models (LLMs) have demonstrated unprecedented capabilities in code generation. However, there remains a limited understanding of code generation errors that LLMs can produce. To bridge the gap, we conducted an in-depth analysis of code generation errors across six representative LLMs on the HumanEval dataset. Specifically, we first employed open coding and thematic analysis to distill a comprehensive taxonomy of code generation errors. We analyzed two dimensions of error characteristics -- semantic characteristics and syntactic characteristics. Our analysis revealed that LLMs often made non-trivial, multi-line code generation errors in various locations and with various root causes. We further analyzed the correlation between these errors and task complexity as well as test pass rate. Our findings highlighted several challenges in locating and fixing code generation errors made by LLMs. In the end, we discussed several future directions to address these challenges.

Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models

TL;DR

This work targets the limited understanding of code-generation errors from large language models (LLMs) by conducting an empirical analysis on six representative LLMs using the HumanEval benchmark. Through open coding and thematic analysis, the authors derive a fine-grained taxonomy of semantic and syntactic error characteristics, quantify repair effort with multiple similarity metrics, and examine how task complexity and test-pass rates relate to error types. Key findings include the predominance of non-trivial, multi-line errors with model-dependent root causes, significant repair effort requirements, and the potential to improve repairs by signaling error characteristics to repair agents. The study provides practical guidance for fault localization, task-difficulty estimation, and taxonomy-guided debugging, along with resources such as an interactive website and code-label datasets to support future research.

Abstract

Large Language Models (LLMs) have demonstrated unprecedented capabilities in code generation. However, there remains a limited understanding of code generation errors that LLMs can produce. To bridge the gap, we conducted an in-depth analysis of code generation errors across six representative LLMs on the HumanEval dataset. Specifically, we first employed open coding and thematic analysis to distill a comprehensive taxonomy of code generation errors. We analyzed two dimensions of error characteristics -- semantic characteristics and syntactic characteristics. Our analysis revealed that LLMs often made non-trivial, multi-line code generation errors in various locations and with various root causes. We further analyzed the correlation between these errors and task complexity as well as test pass rate. Our findings highlighted several challenges in locating and fixing code generation errors made by LLMs. In the end, we discussed several future directions to address these challenges.
Paper Structure (20 sections, 9 figures, 3 tables)

This paper contains 20 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Distribution of semantic characteristics of code generation errors made by six LLMs.
  • Figure 2: Distribution of syntactic characteristics of code generation errors made by six LLMs.
  • Figure 3: Mappings between semantic and syntactic error characteristics of code generation errors made by six LLMs.
  • Figure 4: Levenshtein distance between the incorrect code and correct code. The vertical dashed lines indicate the medians.
  • Figure 5: CodeBERTScore between the incorrect code and correct code. The vertical dashed lines indicate the medians.
  • ...and 4 more figures