Table of Contents
Fetching ...

CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells

Atharva Naik, Marcus Alenius, Daniel Fried, Carolyn Rose

TL;DR

CRScore tackles the inadequacy of reference-based metrics for code review evaluation by adopting a reference-free, dimension-based approach grounded in code-change claims and smells. It generates exhaustive pseudo-references via a neuro-symbolic pipeline combining LLMs and static analyzers, and rates reviews along conciseness, comprehensiveness, and relevance using semantic similarity against these pseudo-references. Empirical validation shows CRScore aligns with human judgments more closely than traditional metrics (Spearman around 0.54 for relevance) and provides robust system rankings, while releasing a large dataset of human judgments (≈2.9k scores). The method offers a scalable, open-source alternative for evaluating code-review generation systems and motivates further improvements in pseudo-reference generation and STS matching. Overall, CRScore represents a meaningful step toward reliable, automatic assessment of code review quality in practical settings.

Abstract

The task of automated code review has recently gained a lot of attention from the machine learning community. However, current review comment evaluation metrics rely on comparisons with a human-written reference for a given code change (also called a diff). Furthermore, code review is a one-to-many problem, like generation and summarization, with many "valid reviews" for a diff. Thus, we develop CRScore - a reference-free metric to measure dimensions of review quality like conciseness, comprehensiveness, and relevance. We design CRScore to evaluate reviews in a way that is grounded in claims and potential issues detected in the code by LLMs and static analyzers. We demonstrate that CRScore can produce valid, fine-grained scores of review quality that have the greatest alignment with human judgment among open source metrics (0.54 Spearman correlation) and are more sensitive than reference-based metrics. We also release a corpus of 2.9k human-annotated review quality scores for machine-generated and GitHub review comments to support the development of automated metrics.

CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells

TL;DR

CRScore tackles the inadequacy of reference-based metrics for code review evaluation by adopting a reference-free, dimension-based approach grounded in code-change claims and smells. It generates exhaustive pseudo-references via a neuro-symbolic pipeline combining LLMs and static analyzers, and rates reviews along conciseness, comprehensiveness, and relevance using semantic similarity against these pseudo-references. Empirical validation shows CRScore aligns with human judgments more closely than traditional metrics (Spearman around 0.54 for relevance) and provides robust system rankings, while releasing a large dataset of human judgments (≈2.9k scores). The method offers a scalable, open-source alternative for evaluating code-review generation systems and motivates further improvements in pseudo-reference generation and STS matching. Overall, CRScore represents a meaningful step toward reliable, automatic assessment of code review quality in practical settings.

Abstract

The task of automated code review has recently gained a lot of attention from the machine learning community. However, current review comment evaluation metrics rely on comparisons with a human-written reference for a given code change (also called a diff). Furthermore, code review is a one-to-many problem, like generation and summarization, with many "valid reviews" for a diff. Thus, we develop CRScore - a reference-free metric to measure dimensions of review quality like conciseness, comprehensiveness, and relevance. We design CRScore to evaluate reviews in a way that is grounded in claims and potential issues detected in the code by LLMs and static analyzers. We demonstrate that CRScore can produce valid, fine-grained scores of review quality that have the greatest alignment with human judgment among open source metrics (0.54 Spearman correlation) and are more sensitive than reference-based metrics. We also release a corpus of 2.9k human-annotated review quality scores for machine-generated and GitHub review comments to support the development of automated metrics.
Paper Structure (51 sections, 8 equations, 8 figures, 18 tables)

This paper contains 51 sections, 8 equations, 8 figures, 18 tables.

Figures (8)

  • Figure 1: Example diff with multiple valid reviews. The ground truth and model-generated reviews focus on different topics, like the performance of the added check, and how likely it is to be triggered. However, a reference-based metric like the BLEU score assigns this review a low score of 0.0458.
  • Figure 2: Operationalization of CRScore: Our metric first generates pseudo-references for the diff --- claims, implications and issues. Then each pseudo-reference is embedded by a sentence transformer along with each review sentence and the pairwise semantic textual similarity (STS) is computed. The high similarity threshold $\tau$ is used to compute the Con and Comp metrics whose harmonic mean yields the Rel score.
  • Figure 3: Supervised fine-tuning pipeline for training Magicoder-6.7B for claim generation. We generate synthetic data by using GPT-4 to generate claims for the code changes in CodeReviewer validation set.
  • Figure 4: This figure shows how semantic textual similarity (STS) is used to measure the coverage of pseudo-references by the review sentences. We compute pairwise semantic similarities between all pseudo references and review sentences and employ a threshold to compute comprehensiveness as the fraction of pseudo references for which at least one review sentence has higher similarity than the threshold. Meanwhile, conciseness is the fraction of review sentences which high have higher similarity than the threshold with any pseudo reference.
  • Figure 5: Histogram of sentence similarity of randomly sampled 100K sentence pairs from the CodeReviewer test set showing the scores are roughly normally distributed, justifying the usage of the 5-sigma rule for coming up with the threshold of 0.85 for high similarity used in metric computation.
  • ...and 3 more figures