Table of Contents
Fetching ...

Studying Quality Improvements Recommended via Manual and Automated Code Review

Giuseppe Crupi, Rosalia Tufano, Gabriele Bavota

TL;DR

This study evaluates how ChatGPT-4 Turbo performs in code-review comment generation compared with human reviewers on real GitHub PRs for Java and Python. By manually labeling 739 human review comments and then running ChatGPT on the same PRs, the authors quantify overlap and show that AI comments are far more numerous but only align with human recommendations in about 10% of cases, while roughly 40% of AI-only comments are meaningful. The results reveal substantial complementarity: AI can bring additional quality signals but cannot replace human inspection due to limited recall and potential non-actionable or hallucinated feedback. The work highlights a practical path forward—use AI as a supplementary co-reviewer to enhance quality checks while maintaining human oversight—and calls for broader validation with different models, prompts, and languages.

Abstract

Several Deep Learning (DL)-based techniques have been proposed to automate code review. Still, it is unclear the extent to which these approaches can recommend quality improvements as a human reviewer. We study the similarities and differences between code reviews performed by humans and those automatically generated by DL models, using ChatGPT-4 as representative of the latter. In particular, we run a mining-based study in which we collect and manually inspect 739 comments posted by human reviewers to suggest code changes in 240 PRs. The manual inspection aims at classifying the type of quality improvement recommended by human reviewers (e.g., rename variable/constant). Then, we ask ChatGPT to perform a code review on the same PRs and we compare the quality improvements it recommends against those suggested by the human reviewers. We show that while, on average, ChatGPT tends to recommend a higher number of code changes as compared to human reviewers (~2.4x more), it can only spot 10% of the quality issues reported by humans. However, ~40% of the additional comments generated by the LLM point to meaningful quality issues. In short, our findings show the complementarity of manual and AI-based code review. This finding suggests that, in its current state, DL-based code review can be used as a further quality check on top of the one performed by humans, but should not be considered as a valid alternative to them nor as a mean to save code review time, since human reviewers would still need to perform their manual inspection while also validating the quality issues reported by the DL-based technique.

Studying Quality Improvements Recommended via Manual and Automated Code Review

TL;DR

This study evaluates how ChatGPT-4 Turbo performs in code-review comment generation compared with human reviewers on real GitHub PRs for Java and Python. By manually labeling 739 human review comments and then running ChatGPT on the same PRs, the authors quantify overlap and show that AI comments are far more numerous but only align with human recommendations in about 10% of cases, while roughly 40% of AI-only comments are meaningful. The results reveal substantial complementarity: AI can bring additional quality signals but cannot replace human inspection due to limited recall and potential non-actionable or hallucinated feedback. The work highlights a practical path forward—use AI as a supplementary co-reviewer to enhance quality checks while maintaining human oversight—and calls for broader validation with different models, prompts, and languages.

Abstract

Several Deep Learning (DL)-based techniques have been proposed to automate code review. Still, it is unclear the extent to which these approaches can recommend quality improvements as a human reviewer. We study the similarities and differences between code reviews performed by humans and those automatically generated by DL models, using ChatGPT-4 as representative of the latter. In particular, we run a mining-based study in which we collect and manually inspect 739 comments posted by human reviewers to suggest code changes in 240 PRs. The manual inspection aims at classifying the type of quality improvement recommended by human reviewers (e.g., rename variable/constant). Then, we ask ChatGPT to perform a code review on the same PRs and we compare the quality improvements it recommends against those suggested by the human reviewers. We show that while, on average, ChatGPT tends to recommend a higher number of code changes as compared to human reviewers (~2.4x more), it can only spot 10% of the quality issues reported by humans. However, ~40% of the additional comments generated by the LLM point to meaningful quality issues. In short, our findings show the complementarity of manual and AI-based code review. This finding suggests that, in its current state, DL-based code review can be used as a further quality check on top of the one performed by humans, but should not be considered as a valid alternative to them nor as a mean to save code review time, since human reviewers would still need to perform their manual inspection while also validating the quality issues reported by the DL-based technique.
Paper Structure (13 sections, 1 figure, 3 tables)

This paper contains 13 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Taxonomy of quality improvements recommended by humans. Bars below each type of quality improvement indicates the percentage of cases in which ChatGPT recommends exactly the same code change (yellow) or at least a related one (blue).