Code Review Automation using Retrieval Augmented Generation
Qianru Meng, Xiao Zhang, Zhaochen Ren, Joost Visser
TL;DR
This work tackles the labor-intensive task of automated code review generation by introducing RARe, a Retrieval-Augmented Reviewer that couples a dense retriever with a large language model to inject external, domain-relevant knowledge into code reviews. By retrieving relevant past reviews and incorporating them into the generation prompt, RARe achieves state-of-the-art performance on two benchmark datasets (BLEU-4 up to 12.96) and gains validated support from human evaluation and an interpretability case study. The authors systematically compare retrievers, generators, and prompting strategies, demonstrating that retrieval quality, especially top-1 results, substantially boosts performance and reduces overly general or off-point outputs. They also provide a thorough discussion of implications, limitations, and threats to validity, and release their code to encourage adoption and further research in automated software engineering tasks.
Abstract
Code review is essential for maintaining software quality but is labor-intensive. Automated code review generation offers a promising solution to this challenge. Both deep learning-based generative techniques and retrieval-based methods have demonstrated strong performance in this task. However, despite these advancements, there are still some limitations where generated reviews can be either off-point or overly general. To address these issues, we introduce Retrieval-Augmented Reviewer (RARe), which leverages Retrieval-Augmented Generation (RAG) to combine retrieval-based and generative methods, explicitly incorporating external domain knowledge into the code review process. RARe uses a dense retriever to select the most relevant reviews from the codebase, which then enrich the input for a neural generator, utilizing the contextual learning capacity of large language models (LLMs), to produce the final review. RARe outperforms state-of-the-art methods on two benchmark datasets, achieving BLEU-4 scores of 12.32 and 12.96, respectively. Its effectiveness is further validated through a detailed human evaluation and a case study using an interpretability tool, demonstrating its practical utility and reliability.
