Table of Contents
Fetching ...

LAURA: Enhancing Code Review Generation with Context-Enriched Retrieval-Augmented LLM

Yuxin Zhang, Yuxia Zhang, Zeyu Sun, Yanjie Jiang, Hui Liu

TL;DR

Code review is a bottleneck in modern software development due to scale and the need for expert knowledge. LAURA presents a three-component framework—context augmentation, review exemplar retrieval, and systematic guidance—to augment LLMs for code review generation, demonstrated on ChatGPT-4o and DeepSeek v3. The approach yields substantial improvements over baselines and ablations, validated by both LLM-based and human evaluations, and includes a high-quality, retrieval-enabled dataset. The results indicate that enriching inputs with richer context and structured prompts can significantly improve the usefulness and quality of automated code reviews, with potential for broad impact in software engineering workflows.

Abstract

Code review is critical for ensuring software quality and maintainability. With the rapid growth in software scale and complexity, code review has become a bottleneck in the development process because of its time-consuming and knowledge-intensive nature and the shortage of experienced developers willing to review code. Several approaches have been proposed for automatically generating code reviews based on retrieval, neural machine translation, pre-trained models, or large language models (LLMs). These approaches mainly leverage historical code changes and review comments. However, a large amount of crucial information for code review, such as the context of code changes and prior review knowledge, has been overlooked. This paper proposes an LLM-based review knowledge-augmented, context-aware framework for code review generation, named LAURA. The framework integrates review exemplar retrieval, context augmentation, and systematic guidance to enhance the performance of ChatGPT-4o and DeepSeek v3 in generating code review comments. Besides, given the extensive low-quality reviews in existing datasets, we also constructed a high-quality dataset. Experimental results show that for both models, LAURA generates review comments that are either completely correct or at least helpful to developers in 42.2% and 40.4% of cases, respectively, significantly outperforming SOTA baselines. Furthermore, our ablation studies demonstrate that all components of LAURA contribute positively to improving comment quality.

LAURA: Enhancing Code Review Generation with Context-Enriched Retrieval-Augmented LLM

TL;DR

Code review is a bottleneck in modern software development due to scale and the need for expert knowledge. LAURA presents a three-component framework—context augmentation, review exemplar retrieval, and systematic guidance—to augment LLMs for code review generation, demonstrated on ChatGPT-4o and DeepSeek v3. The approach yields substantial improvements over baselines and ablations, validated by both LLM-based and human evaluations, and includes a high-quality, retrieval-enabled dataset. The results indicate that enriching inputs with richer context and structured prompts can significantly improve the usefulness and quality of automated code reviews, with potential for broad impact in software engineering workflows.

Abstract

Code review is critical for ensuring software quality and maintainability. With the rapid growth in software scale and complexity, code review has become a bottleneck in the development process because of its time-consuming and knowledge-intensive nature and the shortage of experienced developers willing to review code. Several approaches have been proposed for automatically generating code reviews based on retrieval, neural machine translation, pre-trained models, or large language models (LLMs). These approaches mainly leverage historical code changes and review comments. However, a large amount of crucial information for code review, such as the context of code changes and prior review knowledge, has been overlooked. This paper proposes an LLM-based review knowledge-augmented, context-aware framework for code review generation, named LAURA. The framework integrates review exemplar retrieval, context augmentation, and systematic guidance to enhance the performance of ChatGPT-4o and DeepSeek v3 in generating code review comments. Besides, given the extensive low-quality reviews in existing datasets, we also constructed a high-quality dataset. Experimental results show that for both models, LAURA generates review comments that are either completely correct or at least helpful to developers in 42.2% and 40.4% of cases, respectively, significantly outperforming SOTA baselines. Furthermore, our ablation studies demonstrate that all components of LAURA contribute positively to improving comment quality.

Paper Structure

This paper contains 26 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of the LAURA framework.
  • Figure 2: Prompts used for LAURA and direct generation.
  • Figure 3: A real example of human evaluation of code review comments generated by LAURA-GPT, LAURA-DS, and CodeReviewer.
  • Figure 4: Two real cases of how LAURA assists LLMs in generating useful review comments.