Table of Contents
Fetching ...

Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues

Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D. Le, David Lo

TL;DR

This study provides the first comprehensive, time-sensitive evaluation of ChatGPT-generated code across Java and Python on 2,033 LeetCode tasks, quantifying correctness and a broad spectrum of code-quality issues. It combines automated static analysis, qualitative taxonomy via open card sorting, and prompt-based repair experiments to reveal how task difficulty, code size, and language shape performance, and how feedback-driven prompting can partially repair defects. The authors demonstrate that while ChatGPT can produce functionally correct code in a majority of cases, maintainability and correctness issues remain prevalent, and self-repair via prompts is promising but not yet comprehensive. Importantly, the work releases a large replication dataset to foster ongoing research and tooling improvements for AI-assisted code generation.

Abstract

We systematically study the quality of 4,066 ChatGPT-generated code implemented in two popular programming languages, i.e., Java and Python, for 2,033 programming tasks. The goal of this work is three folds. First, we analyze the correctness of ChatGPT on code generation tasks and uncover the factors that influence its effectiveness, including task difficulty, programming language, time that tasks are introduced, and program size. Second, we identify and characterize potential issues with the quality of ChatGPT-generated code. Last, we provide insights into how these issues can be mitigated. Experiments highlight that out of 4,066 programs generated by ChatGPT, 2,756 programs are deemed correct, 1,082 programs provide wrong outputs, and 177 programs contain compilation or runtime errors. Additionally, we further analyze other characteristics of the generated code through static analysis tools, such as code style and maintainability, and find that 1,930 ChatGPT-generated code snippets suffer from maintainability issues. Subsequently, we investigate ChatGPT's self-repairing ability and its interaction with static analysis tools to fix the errors uncovered in the previous step. Experiments suggest that ChatGPT can partially address these challenges, improving code quality by more than 20%, but there are still limitations and opportunities for improvement. Overall, our study provides valuable insights into the current limitations of ChatGPT and offers a roadmap for future research and development efforts to enhance the code generation capabilities of AI models like ChatGPT.

Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues

TL;DR

This study provides the first comprehensive, time-sensitive evaluation of ChatGPT-generated code across Java and Python on 2,033 LeetCode tasks, quantifying correctness and a broad spectrum of code-quality issues. It combines automated static analysis, qualitative taxonomy via open card sorting, and prompt-based repair experiments to reveal how task difficulty, code size, and language shape performance, and how feedback-driven prompting can partially repair defects. The authors demonstrate that while ChatGPT can produce functionally correct code in a majority of cases, maintainability and correctness issues remain prevalent, and self-repair via prompts is promising but not yet comprehensive. Importantly, the work releases a large replication dataset to foster ongoing research and tooling improvements for AI-assisted code generation.

Abstract

We systematically study the quality of 4,066 ChatGPT-generated code implemented in two popular programming languages, i.e., Java and Python, for 2,033 programming tasks. The goal of this work is three folds. First, we analyze the correctness of ChatGPT on code generation tasks and uncover the factors that influence its effectiveness, including task difficulty, programming language, time that tasks are introduced, and program size. Second, we identify and characterize potential issues with the quality of ChatGPT-generated code. Last, we provide insights into how these issues can be mitigated. Experiments highlight that out of 4,066 programs generated by ChatGPT, 2,756 programs are deemed correct, 1,082 programs provide wrong outputs, and 177 programs contain compilation or runtime errors. Additionally, we further analyze other characteristics of the generated code through static analysis tools, such as code style and maintainability, and find that 1,930 ChatGPT-generated code snippets suffer from maintainability issues. Subsequently, we investigate ChatGPT's self-repairing ability and its interaction with static analysis tools to fix the errors uncovered in the previous step. Experiments suggest that ChatGPT can partially address these challenges, improving code quality by more than 20%, but there are still limitations and opportunities for improvement. Overall, our study provides valuable insights into the current limitations of ChatGPT and offers a roadmap for future research and development efforts to enhance the code generation capabilities of AI models like ChatGPT.
Paper Structure (33 sections, 12 figures, 6 tables)

This paper contains 33 sections, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Example of an buggy code generated by ChatGPT for solving the LeetCode Problem 1093 - 'Statistics from a Large Sample'
  • Figure 2: Overview of our workflow
  • Figure 3: Task distribution across time
  • Figure 4: Task distribution across difficulty
  • Figure 5: Pass Rate by Difficulty and Time Period
  • ...and 7 more figures