Table of Contents
Fetching ...

Harnessing Large Language Models for Curated Code Reviews

Oussama Ben Sghaier, Martin Weyssow, Houari Sahraoui

TL;DR

This paper tackles the problem that noisy code-review data limits AI-assisted commenting and code refinement. It introduces an explicit evaluation framework for code-review comments and an LLM-driven curation pipeline that yields CuRev, a higher-quality, fully civil dataset with improved clarity and conciseness. Experiments show that models trained on CuRev outperform those trained on the original data in automated comment generation (BLEU gains) and code refinement (CodeBLEU/Exact Match gains). The work demonstrates the importance of data quality in software maintenance tasks and provides a practical, reproducible pipeline for future research.

Abstract

In code review, generating structured and relevant comments is crucial for identifying code issues and facilitating accurate code changes that ensure an efficient code review process. Well-crafted comments not only streamline the code review itself but are also essential for subsequent tasks like code refinement, where the code is modified to satisfy the input review comment. Although various AI-based approaches aimed to automate comment generation, their effectiveness remains limited by the quality of the training data. Existing code review datasets are often noisy and unrefined, posing limitations to the learning potential of AI models and hindering the automation process. To address these challenges, we propose a curation pipeline designed to enhance the quality of the largest publicly available code review dataset. We begin by establishing an evaluation framework, incorporating specific criteria and categories to empirically study the initial quality of the dataset. Using a large language model (LLM)-driven approach, we then apply our curation pipeline to refine the dataset. A comparative analysis of the newly curated dataset, based on the same evaluation framework, demonstrates substantial improvements in the clarity and conciseness of the comments. Additionally, we assess the impact of the curated dataset on automating downstream tasks, specifically comment generation and code refinement. Our findings show that the curated dataset leads to enhanced model performance in generating more accurate comments. Curated comments are also more useful as they lead to more accurate code refinement.

Harnessing Large Language Models for Curated Code Reviews

TL;DR

This paper tackles the problem that noisy code-review data limits AI-assisted commenting and code refinement. It introduces an explicit evaluation framework for code-review comments and an LLM-driven curation pipeline that yields CuRev, a higher-quality, fully civil dataset with improved clarity and conciseness. Experiments show that models trained on CuRev outperform those trained on the original data in automated comment generation (BLEU gains) and code refinement (CodeBLEU/Exact Match gains). The work demonstrates the importance of data quality in software maintenance tasks and provides a practical, reproducible pipeline for future research.

Abstract

In code review, generating structured and relevant comments is crucial for identifying code issues and facilitating accurate code changes that ensure an efficient code review process. Well-crafted comments not only streamline the code review itself but are also essential for subsequent tasks like code refinement, where the code is modified to satisfy the input review comment. Although various AI-based approaches aimed to automate comment generation, their effectiveness remains limited by the quality of the training data. Existing code review datasets are often noisy and unrefined, posing limitations to the learning potential of AI models and hindering the automation process. To address these challenges, we propose a curation pipeline designed to enhance the quality of the largest publicly available code review dataset. We begin by establishing an evaluation framework, incorporating specific criteria and categories to empirically study the initial quality of the dataset. Using a large language model (LLM)-driven approach, we then apply our curation pipeline to refine the dataset. A comparative analysis of the newly curated dataset, based on the same evaluation framework, demonstrates substantial improvements in the clarity and conciseness of the comments. Additionally, we assess the impact of the curated dataset on automating downstream tasks, specifically comment generation and code refinement. Our findings show that the curated dataset leads to enhanced model performance in generating more accurate comments. Curated comments are also more useful as they lead to more accurate code refinement.

Paper Structure

This paper contains 24 sections, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Overview of our methodology. We use a large code review dataset of samples comprising pre-commit and post-commit codes along with review comments. For each sample, we use LLM-as-a-Judge with Llama-3.1-70B to generate a reformulated review comment, a categorization of the review, and a score for the original review comment. Next, we use the reformulated review comments to create our curated dataset, while filtering out irrelevant samples. Finally, we compare the effectiveness of LLMs fine-tuned on the original and curated datasets on two downstream tasks: comment generation and code refinement.
  • Figure 2: Overview of our evaluation framework.
  • Figure 3: Distribution of the different categories across the original dataset.
  • Figure 4: Distribution of scoring criteria on the original dataset.
  • Figure 5: Distribution of the clarity and conciseness scoring criteria on the curated dataset.