Improving Code Reviewer Recommendation: Accuracy, Latency, Workload, and Bystanders

Peter C. Rigby; Seth Rogers; Sadruddin Saleem; Parth Suresh; Daniel Suskin; Patrick Riggs; Chandra Maddila; Nachiappan Nagappan

Improving Code Reviewer Recommendation: Accuracy, Latency, Workload, and Bystanders

Peter C. Rigby, Seth Rogers, Sadruddin Saleem, Parth Suresh, Daniel Suskin, Patrick Riggs, Chandra Maddila, Nachiappan Nagappan

TL;DR

This study assesses three production-tested improvements to code reviewer recommendation at Meta: RevRecV2 enhances accuracy and dramatically reduces latency; RevRecWL attempts workload balancing by re-ranking candidates, trading some accuracy for lighter reviewer queues; Bystander RecRnd mitigates the bystander effect by explicitly assigning an individual reviewer when a team is designated. Through three randomized A/B experiments, the authors quantify TopN accuracy, TimeInReview, TimeSpent, Latency, Clicks, and Workload, revealing substantial gains for RevRecV2 ($ ext{Top3} o +14.19$ pp, $ ext{p}<0.01$, latency $ imes 14$ reduction at $p<0.01$) but notable trade-offs for workload-aware re-ranking in RevRecWL (Top1 −$4.90$ pp, $p<0.01$). The Bystander RecRnd approach yields a real-time improvement in review speed ($-11.6 ext{% TimeInReview}$, $p<0.01$) with no guardrail regressions, supporting its deployment. The work also highlights a gap between backtesting and live experiments, underscoring the importance of production A/B tests for validating recommender effects and guiding practical deployment decisions.

Abstract

The code review team at Meta is continuously improving the code review process. To evaluate the new recommenders, we conduct three A/B tests which are a type of randomized controlled experimental trial. Expt 1. We developed a new recommender based on features that had been successfully used in the literature and that could be calculated with low latency. In an A/B test on 82k diffs in Spring of 2022, we found that the new recommender was more accurate and had lower latency. Expt 2. Reviewer workload is not evenly distributed, our goal was to reduce the workload of top reviewers. We then ran an A/B test on 28k diff authors in Winter 2023 on a workload balanced recommender. Our A/B test led to mixed results. Expt 3. We suspected the bystander effect might be slowing down reviews of diffs where only a team was assigned. We conducted an A/B test on 12.5k authors in Spring 2023 and found a large decrease in the amount of time it took for diffs to be reviewed when a recommended individual was explicitly assigned. Our findings also suggest there can be a discrepancy between historical back-testing and A/B test experimental findings.

Improving Code Reviewer Recommendation: Accuracy, Latency, Workload, and Bystanders

TL;DR

pp,

, latency

reduction at

) but notable trade-offs for workload-aware re-ranking in RevRecWL (Top1 −

pp,

). The Bystander RecRnd approach yields a real-time improvement in review speed (

) with no guardrail regressions, supporting its deployment. The work also highlights a gap between backtesting and live experiments, underscoring the importance of production A/B tests for validating recommender effects and guiding practical deployment decisions.

Abstract

Paper Structure (25 sections, 2 equations, 1 figure, 10 tables)

This paper contains 25 sections, 2 equations, 1 figure, 10 tables.

Introduction
Background and Code Review Process
Code review process
Reviewer Recommenders at Meta
Literature, Features, and Recommender Design
Design and Feature Importance for RevRecV2
Design for RevRecWL
Design for RecBystander
Experimental Method and Outcome Metrics
Goal and guardrail Outcomes Metrics
Expt. 1: Accuracy and Latency
Expt. 2: Balancing Reviewer Workload
Historical Backtest Results
Results for Expt 2. RevRecWL in Production
The Bystander Effect
...and 10 more sections

Figures (1)

Figure 1: A code change, i.e. diff, under review at Meta. The recommended candidates for the review are clickable by the author and ordered left to right with the top ranked candidate on the left (see the red box in the figure). We have anonymized the diff and used mock names.

Improving Code Reviewer Recommendation: Accuracy, Latency, Workload, and Bystanders

TL;DR

Abstract

Improving Code Reviewer Recommendation: Accuracy, Latency, Workload, and Bystanders

Authors

TL;DR

Abstract

Table of Contents

Figures (1)