Table of Contents
Fetching ...

Understanding Defects in Generated Codes by Language Models

Ali Mohammadi Esfahani, Nafiseh Kahani, Samuel A. Ajila

TL;DR

The paper tackles the problem of unreliable code generation by CLMs, measuring and classifying defects in 367 generated snippets on HumanEval and evaluating whether prompt engineering can mitigate these issues. It employs an integrated defect taxonomy combining ODC and IEEE standards, analyzes model performance with EM, pass@k, and CodeBLEU, and investigates the impact of five prompting strategies, with Structured CoT delivering the strongest gains. The key contribution is a comprehensive defect taxonomy, empirical evidence that prompt engineering improves code correctness, and guidance for safer deployment of CLMs in software tasks. The findings highlight practical implications for improving CLM reliability in real-world coding tasks and motivate further research into adaptive prompting and broader model comparisons.

Abstract

This study investigates the reliability of code generation by Large Language Models (LLMs), focusing on identifying and analyzing defects in the generated code. Despite the advanced capabilities of LLMs in automating code generation, ensuring the accuracy and functionality of the output remains a significant challenge. By using a structured defect classification method to understand their nature and origins this study categorizes and analyzes 367 identified defects from code snippets generated by LLMs, with a significant proportion being functionality and algorithm errors. These error categories indicate key areas where LLMs frequently fail, underscoring the need for targeted improvements. To enhance the accuracy of code generation, this paper implemented five prompt engineering techniques, including Scratchpad Prompting, Program of Thoughts Prompting, Chain-of-Thought Prompting, Chain of Code Prompting, and Structured Chain-of-Thought Prompting. These techniques were applied to refine the input prompts, aiming to reduce ambiguities and improve the models' accuracy rate. The research findings suggest that precise and structured prompting significantly mitigates common defects, thereby increasing the reliability of LLM-generated code.

Understanding Defects in Generated Codes by Language Models

TL;DR

The paper tackles the problem of unreliable code generation by CLMs, measuring and classifying defects in 367 generated snippets on HumanEval and evaluating whether prompt engineering can mitigate these issues. It employs an integrated defect taxonomy combining ODC and IEEE standards, analyzes model performance with EM, pass@k, and CodeBLEU, and investigates the impact of five prompting strategies, with Structured CoT delivering the strongest gains. The key contribution is a comprehensive defect taxonomy, empirical evidence that prompt engineering improves code correctness, and guidance for safer deployment of CLMs in software tasks. The findings highlight practical implications for improving CLM reliability in real-world coding tasks and motivate further research into adaptive prompting and broader model comparisons.

Abstract

This study investigates the reliability of code generation by Large Language Models (LLMs), focusing on identifying and analyzing defects in the generated code. Despite the advanced capabilities of LLMs in automating code generation, ensuring the accuracy and functionality of the output remains a significant challenge. By using a structured defect classification method to understand their nature and origins this study categorizes and analyzes 367 identified defects from code snippets generated by LLMs, with a significant proportion being functionality and algorithm errors. These error categories indicate key areas where LLMs frequently fail, underscoring the need for targeted improvements. To enhance the accuracy of code generation, this paper implemented five prompt engineering techniques, including Scratchpad Prompting, Program of Thoughts Prompting, Chain-of-Thought Prompting, Chain of Code Prompting, and Structured Chain-of-Thought Prompting. These techniques were applied to refine the input prompts, aiming to reduce ambiguities and improve the models' accuracy rate. The research findings suggest that precise and structured prompting significantly mitigates common defects, thereby increasing the reliability of LLM-generated code.
Paper Structure (15 sections, 3 equations, 2 figures, 4 tables)