Improving Code Reviewer Recommendation: Accuracy, Latency, Workload, and Bystanders
Peter C. Rigby, Seth Rogers, Sadruddin Saleem, Parth Suresh, Daniel Suskin, Patrick Riggs, Chandra Maddila, Nachiappan Nagappan
TL;DR
This study assesses three production-tested improvements to code reviewer recommendation at Meta: RevRecV2 enhances accuracy and dramatically reduces latency; RevRecWL attempts workload balancing by re-ranking candidates, trading some accuracy for lighter reviewer queues; Bystander RecRnd mitigates the bystander effect by explicitly assigning an individual reviewer when a team is designated. Through three randomized A/B experiments, the authors quantify TopN accuracy, TimeInReview, TimeSpent, Latency, Clicks, and Workload, revealing substantial gains for RevRecV2 ($ ext{Top3} o +14.19$ pp, $ ext{p}<0.01$, latency $ imes 14$ reduction at $p<0.01$) but notable trade-offs for workload-aware re-ranking in RevRecWL (Top1 −$4.90$ pp, $p<0.01$). The Bystander RecRnd approach yields a real-time improvement in review speed ($-11.6 ext{% TimeInReview}$, $p<0.01$) with no guardrail regressions, supporting its deployment. The work also highlights a gap between backtesting and live experiments, underscoring the importance of production A/B tests for validating recommender effects and guiding practical deployment decisions.
Abstract
The code review team at Meta is continuously improving the code review process. To evaluate the new recommenders, we conduct three A/B tests which are a type of randomized controlled experimental trial. Expt 1. We developed a new recommender based on features that had been successfully used in the literature and that could be calculated with low latency. In an A/B test on 82k diffs in Spring of 2022, we found that the new recommender was more accurate and had lower latency. Expt 2. Reviewer workload is not evenly distributed, our goal was to reduce the workload of top reviewers. We then ran an A/B test on 28k diff authors in Winter 2023 on a workload balanced recommender. Our A/B test led to mixed results. Expt 3. We suspected the bystander effect might be slowing down reviews of diffs where only a team was assigned. We conducted an A/B test on 12.5k authors in Spring 2023 and found a large decrease in the amount of time it took for diffs to be reviewed when a recommended individual was explicitly assigned. Our findings also suggest there can be a discrepancy between historical back-testing and A/B test experimental findings.
