An Empirical Study on Code Review Activity Prediction and Its Impact in Practice
Doriane Olewicki, Sarra Habchi, Bram Adams
TL;DR
This study tackles the problem of lengthy and uneven code-review processes by predicting which patch files will require comments, revisions, or constitute hot-spots. It evaluates two text-embedding families (BoW and encoder-based LLMs) and a suite of review-process features across five large datasets (three open-source and two closed-source), showing that combining content embeddings with process features consistently improves prediction performance, with median F1-scores reaching 40–62% and gains up to 9% over state-of-the-art baselines. A key finding is that hot-spot based file ordering—prioritizing files likely to need review activity—improves the distribution of attention to critical parts and increases the number and targeting quality of comments, though computation time for LLM embeddings is a practical bottleneck. The work provides practical guidance for deploying lightweight, feature-based representations and combining signals to enhance review efficiency, and it validates the potential of hot-spot prediction to accelerate software quality assurance in real-world settings.
Abstract
During code reviews, an essential step in software quality assurance, reviewers have the difficult task of understanding and evaluating code changes to validate their quality and prevent introducing faults to the codebase. This is a tedious process where the effort needed is highly dependent on the code submitted, as well as the author's and the reviewer's experience, leading to median wait times for review feedback of 15-64 hours. Through an initial user study carried with 29 experts, we found that re-ordering the files changed by a patch within the review environment has potential to improve review quality, as more comments are written (+23%), and participants' file-level hot-spot precision and recall increases to 53% (+13%) and 28% (+8%), respectively, compared to the alphanumeric ordering. Hence, this paper aims to help code reviewers by predicting which files in a submitted patch need to be (1) commented, (2) revised, or (3) are hot-spots (commented or revised). To predict these tasks, we evaluate two different types of text embeddings (i.e., Bag-of-Words and Large Language Models encoding) and review process features (i.e., code size-based and history-based features). Our empirical study on three open-source and two industrial datasets shows that combining the code embedding and review process features leads to better results than the state-of-the-art approach. For all tasks, F1-scores (median of 40-62%) are significantly better than the state-of-the-art (from +1 to +9%).
