Table of Contents
Fetching ...

DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation

Junyi Lu, Xiaojia Li, Zihan Hua, Lei Yu, Shiqi Cheng, Li Yang, Fengjun Zhang, Chun Zuo

TL;DR

This work challenges the reliance on text-similarity metrics for evaluating code review comment generation, arguing that benchmark comments are often unreliable and misaligned with the goal of improving code quality. It introduces DeepCRCEval, a dual human/LLM evaluation framework that scores comments across nine domain-specific criteria and enables explicit scoring and ranking, and LLM-Reviewer, a target-oriented, training-free baseline leveraging few-shot prompts. Empirical results show that traditional SOTA CRCGs underperform compared with LLM-Reviewer, while DeepCRCEval offers greater discrimination and comprehensiveness and substantially reduces time and cost when using LLM evaluators. The study also provides case studies and publicly available materials, advocating a shift toward evaluation practices that reflect the true objectives of code reviews rather than mere textual similarity, with implications for future CRCG development and assessment.

Abstract

Code review is a vital but demanding aspect of software development, generating significant interest in automating review comments. Traditional evaluation methods for these comments, primarily based on text similarity, face two major challenges: inconsistent reliability of human-authored comments in open-source projects and the weak correlation of text similarity with objectives like enhancing code quality and detecting defects. This study empirically analyzes benchmark comments using a novel set of criteria informed by prior research and developer interviews. We then similarly revisit the evaluation of existing methodologies. Our evaluation framework, DeepCRCEval, integrates human evaluators and Large Language Models (LLMs) for a comprehensive reassessment of current techniques based on the criteria set. Besides, we also introduce an innovative and efficient baseline, LLM-Reviewer, leveraging the few-shot learning capabilities of LLMs for a target-oriented comparison. Our research highlights the limitations of text similarity metrics, finding that less than 10% of benchmark comments are high quality for automation. In contrast, DeepCRCEval effectively distinguishes between high and low-quality comments, proving to be a more reliable evaluation mechanism. Incorporating LLM evaluators into DeepCRCEval significantly boosts efficiency, reducing time and cost by 88.78% and 90.32%, respectively. Furthermore, LLM-Reviewer demonstrates significant potential of focusing task real targets in comment generation.

DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation

TL;DR

This work challenges the reliance on text-similarity metrics for evaluating code review comment generation, arguing that benchmark comments are often unreliable and misaligned with the goal of improving code quality. It introduces DeepCRCEval, a dual human/LLM evaluation framework that scores comments across nine domain-specific criteria and enables explicit scoring and ranking, and LLM-Reviewer, a target-oriented, training-free baseline leveraging few-shot prompts. Empirical results show that traditional SOTA CRCGs underperform compared with LLM-Reviewer, while DeepCRCEval offers greater discrimination and comprehensiveness and substantially reduces time and cost when using LLM evaluators. The study also provides case studies and publicly available materials, advocating a shift toward evaluation practices that reflect the true objectives of code reviews rather than mere textual similarity, with implications for future CRCG development and assessment.

Abstract

Code review is a vital but demanding aspect of software development, generating significant interest in automating review comments. Traditional evaluation methods for these comments, primarily based on text similarity, face two major challenges: inconsistent reliability of human-authored comments in open-source projects and the weak correlation of text similarity with objectives like enhancing code quality and detecting defects. This study empirically analyzes benchmark comments using a novel set of criteria informed by prior research and developer interviews. We then similarly revisit the evaluation of existing methodologies. Our evaluation framework, DeepCRCEval, integrates human evaluators and Large Language Models (LLMs) for a comprehensive reassessment of current techniques based on the criteria set. Besides, we also introduce an innovative and efficient baseline, LLM-Reviewer, leveraging the few-shot learning capabilities of LLMs for a target-oriented comparison. Our research highlights the limitations of text similarity metrics, finding that less than 10% of benchmark comments are high quality for automation. In contrast, DeepCRCEval effectively distinguishes between high and low-quality comments, proving to be a more reliable evaluation mechanism. Incorporating LLM evaluators into DeepCRCEval significantly boosts efficiency, reducing time and cost by 88.78% and 90.32%, respectively. Furthermore, LLM-Reviewer demonstrates significant potential of focusing task real targets in comment generation.

Paper Structure

This paper contains 57 sections, 2 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: The overall workflow of learning a deep neural network (DNN) model or retriever to automate code review.
  • Figure 2: Overview of our study. * indicates frameworks or models we newly proposed.
  • Figure 3: 4-group Venn diagrams showing the overlap of suitable quality, category, tone, and context in comments.
  • Figure 4: The overall workflow of LLM-Reviewer.
  • Figure 5: User feedback ratings distribution for "Good", "Acceptable", and "Poor".
  • ...and 2 more figures