Code Review Automation: Strengths and Weaknesses of the State of the Art

Rosalia Tufano; Ozren Dabić; Antonio Mastropaolo; Matteo Ciniselli; Gabriele Bavota

Code Review Automation: Strengths and Weaknesses of the State of the Art

Rosalia Tufano, Ozren Dabić, Antonio Mastropaolo, Matteo Ciniselli, Gabriele Bavota

TL;DR

The paper critically evaluates three state-of-the-art code review automation techniques on two tasks, revealing that while deep learning and IR-based approaches can automate simple code review changes, they struggle with complex cross-component changes due to limited code context. Through a rigorous qualitative analysis of 2,291 predictions, the authors create two taxonomies of code change types, uncover dataset quality issues with roughly 25% noise, and compare SOTA methods against a general LLM (ChatGPT). The study demonstrates that ChatGPT can match or exceed SOTA performance for certain code&comment-to-code scenarios but lags in code-to-comment tasks, underscoring the need for specialized, context-rich models and cleaner datasets. The findings highlight practical implications for advancing code review automation and provide replication data to foster further research and benchmarking.

Abstract

The automation of code review has been tackled by several researchers with the goal of reducing its cost. The adoption of deep learning in software engineering pushed the automation to new boundaries, with techniques imitating developers in generative tasks, such as commenting on a code change as a reviewer would do or addressing a reviewer's comment by modifying code. The performance of these techniques is usually assessed through quantitative metrics, e.g., the percentage of instances in the test set for which correct predictions are generated, leaving many open questions on the techniques' capabilities. For example, knowing that an approach is able to correctly address a reviewer's comment in 10% of cases is of little value without knowing what was asked by the reviewer: What if in all successful cases the code change required to address the comment was just the removal of an empty line? In this paper we aim at characterizing the cases in which three code review automation techniques tend to succeed or fail in the two above-described tasks. The study has a strong qualitative focus, with ~105 man-hours of manual inspection invested in manually analyzing correct and wrong predictions generated by the three techniques, for a total of 2,291 inspected predictions. The output of this analysis are two taxonomies reporting, for each of the two tasks, the types of code changes on which the experimented techniques tend to succeed or to fail, pointing to areas for future work. A result of our manual analysis was also the identification of several issues in the datasets used to train and test the experimented techniques. Finally, we assess the importance of researching in techniques specialized for code review automation by comparing their performance with ChatGPT, a general purpose large language model, finding that ChatGPT struggles in commenting code as a human reviewer would do.

Code Review Automation: Strengths and Weaknesses of the State of the Art

TL;DR

Abstract

Paper Structure (18 sections, 1 equation, 3 figures, 6 tables)

This paper contains 18 sections, 1 equation, 3 figures, 6 tables.

Introduction
Related Work
Study Design
Study Context
Techniques for Code Review Automation
Datasets and Predictions
Data Collection and Analysis
RQ$_1$: Correct vs wrong recommendations
RQ$_2$: Datasets quality
RQ$_3$: Comparison with LLMs
Results Discussion
RQ$_1$: Correct vs wrong recommendations
Code-to-comment
Code & comment-to-code
RQ$_2$: Datasets quality
...and 3 more sections

Figures (3)

Figure 1: Taxonomy of types of changes for the code-to-comment task. The color assigned to each label reflects the ability of the techniques to automate the code review task in the context of such a change type (white best, black worst). We report the percentage of successful predictions by each approach for each change type as bars below the corresponding category: T5cr (blue bar), CodeReviewer (green), and CommentFinder (red).
Figure 2: Taxonomy of types of changes for the code & comment-to-code task. The color assigned to each label reflects the ability of the techniques to automate the code review task in the context of such a change type (white best, black worst). We report the percentage of successful predictions by each approach for each change type as bars below the corresponding category: T5cr (blue bar), CodeReviewer (green).
Figure 3: Task complexity for correct and wrong predictions

Code Review Automation: Strengths and Weaknesses of the State of the Art

TL;DR

Abstract

Code Review Automation: Strengths and Weaknesses of the State of the Art

Authors

TL;DR

Abstract

Table of Contents

Figures (3)