No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT

Zhijie Liu; Yutian Tang; Xiapu Luo; Yuming Zhou; Liang Feng Zhang

No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT

Zhijie Liu, Yutian Tang, Xiapu Luo, Yuming Zhou, Liang Feng Zhang

TL;DR

This work presents a comprehensive, multi-language evaluation of ChatGPT-based code generation across 728 LeetCode problems and 18 CWE scenarios, examining correctness, code complexity, and security, and it investigates the efficacy of a multi-round fixing workflow. It shows that ChatGPT performs notably better on problems published before 2021, while direct fixes via dialog are limited, and it documents how code complexity tends to rise or remain stable through iterative fixes. Security analysis reveals prevalent vulnerabilities in generated code, but a CWE-guided multi-round fixing approach dramatically improves vulnerability remediation, suggesting that coupling LLMs with static analysis can enhance practical code safety. The study offers actionable insights into the capabilities and limitations of AI-assisted code generation and provides an online artifact to support reproducibility and future improvements in AI-assisted software engineering.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities across various NLP tasks. Additionally, LLMs are also highly valuable in supporting software engineering tasks, particularly in the field of code generation. Automatic code generation is a process of automatically generating source code or executable code based on given specifications or requirements, improving developer productivity. In this study, we perform a systematic empirical assessment to the quality of code generation using ChatGPT. We leverage 728 algorithm problems in five languages (i.e., C, C++, Java, Python, and JavaScript) and 18 CWEs with 54 code scenarios for the code generation task. Our evaluation encompasses a comprehensive analysis of code snippets generated by ChatGPT, focusing on three critical aspects: correctness, complexity, and security. We also specifically investigate ChatGPT's ability to engage in multi-round fixing process (i.e., ChatGPT's dialog ability) of facilitating code generation. By delving into the generated code and examining the experimental results, this work provides valuable insights into the performance of ChatGPT in tackling code generation tasks over the three critical aspects. Overall, our findings uncover potential issues and limitations that arise in the ChatGPT-based code generation and lay the groundwork for improving AI and LLM-based code generation techniques.

No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT

TL;DR

Abstract

Paper Structure (22 sections, 1 equation, 32 figures, 32 tables)

This paper contains 22 sections, 1 equation, 32 figures, 32 tables.

Introduction
Background
Empirical Study Setup
Data Collection
Methodology
Experiment Environment
Experiment and Evaluation
Functionally Correct Code Generation
Multi-round Fixing for Code Generation
Code with Wrong Answer
Code with Compile Error
Code with Runtime Error
Code with Time Limit Exceeded
Code Complexity
Security Code Generation
...and 7 more sections

Figures (32)

Figure 1: ChatGPT-generated Bubble Sort Algorithm in Python.
Figure 2: The workflow of interacting with ChatGPT to generate code snippets.
Figure 3: An example of prompt for two sum problem in Python3.
Figure 4: Function in C code generated by ChatGPT is not declared before invocation.
Figure 5: Distribution of the ratios of languages accepted to corresponding Aft. problems. Where dark violet and dark red lines represent the median and mean, respectively.
...and 27 more figures

No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT

TL;DR

Abstract

No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT

Authors

TL;DR

Abstract

Table of Contents

Figures (32)