Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study

Qi Guo; Junming Cao; Xiaofei Xie; Shangqing Liu; Xiaohong Li; Bihuan Chen; Xin Peng

Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study

Qi Guo, Junming Cao, Xiaofei Xie, Shangqing Liu, Xiaohong Li, Bihuan Chen, Xin Peng

TL;DR

The paper investigates the potential of ChatGPT for automated code refinement during code review by conducting a rigorous empirical study using the CodeReview benchmark and a newly created CodeReview-New dataset. It compares ChatGPT (GPT-3.5-Turbo in zero-shot mode) with CodeReviewer (a state-of-the-art refinement tool) and analyzes the impact of prompts and temperature on performance, revealing that ChatGPT can generalize better to unseen reviews but still lags behind ideal performance in certain tasks. The study identifies root causes, such as domain-knowledge gaps and unclear review locations, and offers mitigation strategies including harnessing GPT-4 and improving review quality. Overall, the work demonstrates promising potential for ChatGPT to assist automated code refinement and provides concrete directions for improving evaluation, data quality, and model capabilities.

Abstract

Code review is an essential activity for ensuring the quality and maintainability of software projects. However, it is a time-consuming and often error-prone task that can significantly impact the development process. Recently, ChatGPT, a cutting-edge language model, has demonstrated impressive performance in various natural language processing tasks, suggesting its potential to automate code review processes. However, it is still unclear how well ChatGPT performs in code review tasks. To fill this gap, in this paper, we conduct the first empirical study to understand the capabilities of ChatGPT in code review tasks, specifically focusing on automated code refinement based on given code reviews. To conduct the study, we select the existing benchmark CodeReview and construct a new code review dataset with high quality. We use CodeReviewer, a state-of-the-art code review tool, as a baseline for comparison with ChatGPT. Our results show that ChatGPT outperforms CodeReviewer in code refinement tasks. Specifically, our results show that ChatGPT achieves higher EM and BLEU scores of 22.78 and 76.44 respectively, while the state-of-the-art method achieves only 15.50 and 62.88 on a high-quality code review dataset. We further identify the root causes for ChatGPT's underperformance and propose several strategies to mitigate these challenges. Our study provides insights into the potential of ChatGPT in automating the code review process, and highlights the potential research directions.

Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study

TL;DR

Abstract

Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study

Authors

TL;DR

Abstract

Table of Contents

Figures (7)