CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells
Atharva Naik, Marcus Alenius, Daniel Fried, Carolyn Rose
TL;DR
CRScore tackles the inadequacy of reference-based metrics for code review evaluation by adopting a reference-free, dimension-based approach grounded in code-change claims and smells. It generates exhaustive pseudo-references via a neuro-symbolic pipeline combining LLMs and static analyzers, and rates reviews along conciseness, comprehensiveness, and relevance using semantic similarity against these pseudo-references. Empirical validation shows CRScore aligns with human judgments more closely than traditional metrics (Spearman around 0.54 for relevance) and provides robust system rankings, while releasing a large dataset of human judgments (≈2.9k scores). The method offers a scalable, open-source alternative for evaluating code-review generation systems and motivates further improvements in pseudo-reference generation and STS matching. Overall, CRScore represents a meaningful step toward reliable, automatic assessment of code review quality in practical settings.
Abstract
The task of automated code review has recently gained a lot of attention from the machine learning community. However, current review comment evaluation metrics rely on comparisons with a human-written reference for a given code change (also called a diff). Furthermore, code review is a one-to-many problem, like generation and summarization, with many "valid reviews" for a diff. Thus, we develop CRScore - a reference-free metric to measure dimensions of review quality like conciseness, comprehensiveness, and relevance. We design CRScore to evaluate reviews in a way that is grounded in claims and potential issues detected in the code by LLMs and static analyzers. We demonstrate that CRScore can produce valid, fine-grained scores of review quality that have the greatest alignment with human judgment among open source metrics (0.54 Spearman correlation) and are more sensitive than reference-based metrics. We also release a corpus of 2.9k human-annotated review quality scores for machine-generated and GitHub review comments to support the development of automated metrics.
