Table of Contents
Fetching ...

Too Noisy To Learn: Enhancing Data Quality for Code Review Comment Generation

Chunhua Liu, Hong Yi Lin, Patanamon Thongtanunam

TL;DR

Code review data often contain noisy, non-actionable comments that hinder neural models from producing useful feedback. The authors propose an LLM-based semantic cleaning pipeline to classify and filter valid versus noisy comments, achieving precision between $66\%$ and $85\%$ in identifying valid comments on CodeReviewer. When models are trained on cleaned data, BLEU-4 scores improve by about $12$–$13\%$, and generated comments show substantial gains in informativeness (up to $24\%$) and relevance (about $11\%$). This work demonstrates that dataset quality is a critical factor for automated code review, offering a scalable approach to improve practical utility and encouraging further research into data-cleaning strategies for code-related NLP tasks.

Abstract

Code review is an important practice in software development, yet it is time-consuming and requires substantial effort. While open-source datasets have been used to train neural models for automating code review tasks, including review comment generation, these datasets contain a significant amount of noisy comments (e.g., vague or non-actionable feedback) that persist despite cleaning methods using heuristics and machine learning approaches. Such remaining noise may lead models to generate low-quality review comments, yet removing them requires a complex semantic understanding of both code changes and natural language comments. In this paper, we investigate the impact of such noise on review comment generation and propose a novel approach using large language models (LLMs) to further clean these datasets. Based on an empirical study on a large-scale code review dataset, our LLM-based approach achieves 66-85% precision in detecting valid comments. Using the predicted valid comments to fine-tune the state-of-the-art code review models (cleaned models) can generate review comments that are 13.0% - 12.4% more similar to valid human-written comments than the original models. We also find that the cleaned models can generate more informative and relevant comments than the original models. Our findings underscore the critical impact of dataset quality on the performance of review comment generation. We advocate for further research into cleaning training data to enhance the practical utility and quality of automated code review.

Too Noisy To Learn: Enhancing Data Quality for Code Review Comment Generation

TL;DR

Code review data often contain noisy, non-actionable comments that hinder neural models from producing useful feedback. The authors propose an LLM-based semantic cleaning pipeline to classify and filter valid versus noisy comments, achieving precision between and in identifying valid comments on CodeReviewer. When models are trained on cleaned data, BLEU-4 scores improve by about , and generated comments show substantial gains in informativeness (up to ) and relevance (about ). This work demonstrates that dataset quality is a critical factor for automated code review, offering a scalable approach to improve practical utility and encouraging further research into data-cleaning strategies for code-related NLP tasks.

Abstract

Code review is an important practice in software development, yet it is time-consuming and requires substantial effort. While open-source datasets have been used to train neural models for automating code review tasks, including review comment generation, these datasets contain a significant amount of noisy comments (e.g., vague or non-actionable feedback) that persist despite cleaning methods using heuristics and machine learning approaches. Such remaining noise may lead models to generate low-quality review comments, yet removing them requires a complex semantic understanding of both code changes and natural language comments. In this paper, we investigate the impact of such noise on review comment generation and propose a novel approach using large language models (LLMs) to further clean these datasets. Based on an empirical study on a large-scale code review dataset, our LLM-based approach achieves 66-85% precision in detecting valid comments. Using the predicted valid comments to fine-tune the state-of-the-art code review models (cleaned models) can generate review comments that are 13.0% - 12.4% more similar to valid human-written comments than the original models. We also find that the cleaned models can generate more informative and relevant comments than the original models. Our findings underscore the critical impact of dataset quality on the performance of review comment generation. We advocate for further research into cleaning training data to enhance the practical utility and quality of automated code review.

Paper Structure

This paper contains 25 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Examples of noisy (Top) and valid (Bottom) comments for automated review comment generation.
  • Figure 2: An overview of the pipeline of our study.
  • Figure 3: The prompt template for noisy classification using $\text{P}_{\textsc{Definition}}$ with context.
  • Figure 4: Distribution of information and relevance scores on tests across CodeReviewer models trained on different training sets.
  • Figure 5: Example comments generated by original and cleaned models with information (Info) and Relevance (Rel) scores.