Table of Contents
Fetching ...

Code Review Automation using Retrieval Augmented Generation

Qianru Meng, Xiao Zhang, Zhaochen Ren, Joost Visser

TL;DR

This work tackles the labor-intensive task of automated code review generation by introducing RARe, a Retrieval-Augmented Reviewer that couples a dense retriever with a large language model to inject external, domain-relevant knowledge into code reviews. By retrieving relevant past reviews and incorporating them into the generation prompt, RARe achieves state-of-the-art performance on two benchmark datasets (BLEU-4 up to 12.96) and gains validated support from human evaluation and an interpretability case study. The authors systematically compare retrievers, generators, and prompting strategies, demonstrating that retrieval quality, especially top-1 results, substantially boosts performance and reduces overly general or off-point outputs. They also provide a thorough discussion of implications, limitations, and threats to validity, and release their code to encourage adoption and further research in automated software engineering tasks.

Abstract

Code review is essential for maintaining software quality but is labor-intensive. Automated code review generation offers a promising solution to this challenge. Both deep learning-based generative techniques and retrieval-based methods have demonstrated strong performance in this task. However, despite these advancements, there are still some limitations where generated reviews can be either off-point or overly general. To address these issues, we introduce Retrieval-Augmented Reviewer (RARe), which leverages Retrieval-Augmented Generation (RAG) to combine retrieval-based and generative methods, explicitly incorporating external domain knowledge into the code review process. RARe uses a dense retriever to select the most relevant reviews from the codebase, which then enrich the input for a neural generator, utilizing the contextual learning capacity of large language models (LLMs), to produce the final review. RARe outperforms state-of-the-art methods on two benchmark datasets, achieving BLEU-4 scores of 12.32 and 12.96, respectively. Its effectiveness is further validated through a detailed human evaluation and a case study using an interpretability tool, demonstrating its practical utility and reliability.

Code Review Automation using Retrieval Augmented Generation

TL;DR

This work tackles the labor-intensive task of automated code review generation by introducing RARe, a Retrieval-Augmented Reviewer that couples a dense retriever with a large language model to inject external, domain-relevant knowledge into code reviews. By retrieving relevant past reviews and incorporating them into the generation prompt, RARe achieves state-of-the-art performance on two benchmark datasets (BLEU-4 up to 12.96) and gains validated support from human evaluation and an interpretability case study. The authors systematically compare retrievers, generators, and prompting strategies, demonstrating that retrieval quality, especially top-1 results, substantially boosts performance and reduces overly general or off-point outputs. They also provide a thorough discussion of implications, limitations, and threats to validity, and release their code to encourage adoption and further research in automated software engineering tasks.

Abstract

Code review is essential for maintaining software quality but is labor-intensive. Automated code review generation offers a promising solution to this challenge. Both deep learning-based generative techniques and retrieval-based methods have demonstrated strong performance in this task. However, despite these advancements, there are still some limitations where generated reviews can be either off-point or overly general. To address these issues, we introduce Retrieval-Augmented Reviewer (RARe), which leverages Retrieval-Augmented Generation (RAG) to combine retrieval-based and generative methods, explicitly incorporating external domain knowledge into the code review process. RARe uses a dense retriever to select the most relevant reviews from the codebase, which then enrich the input for a neural generator, utilizing the contextual learning capacity of large language models (LLMs), to produce the final review. RARe outperforms state-of-the-art methods on two benchmark datasets, achieving BLEU-4 scores of 12.32 and 12.96, respectively. Its effectiveness is further validated through a detailed human evaluation and a case study using an interpretability tool, demonstrating its practical utility and reliability.

Paper Structure

This paper contains 29 sections, 8 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: The overall architecture of RARe. Different colors distinguish the retrieval, generation, and RAG processes. Within the review of the target code, the text related to generation is marked in blue, while those related to retrieval are marked in green.
  • Figure 2: An example from Tuf. dataset and the saliency heatmap comparison for fine-tuned Llama 3.1 with and without retrieval augmentation. The horizontal rows of tables display the models' output. Different input components are differentiated using colors, and higher attention scores are highlighted in red within the table.