Table of Contents
Fetching ...

CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects

Hanyang Guo, Xunjin Zheng, Zihan Liao, Hang Yu, Peng DI, Ziyin Zhang, Hong-Ning Dai

TL;DR

CodeFuse-CR-Bench tackles the realism gap in automated code review by offering a comprehensiveness-aware benchmark that preserves repository-level context for end-to-end CR evaluation in Python. It combines a rule-based and model-based evaluation framework, including a reward model trained on large-scale CR signals and LLMs-as-a-judge to assess semantic quality, alongside precise location and defect matching metrics. The paper constructs 601 high-quality CR task instances from 70 projects through a five-step pipeline, and demonstrates that no single LLM dominates across all dimensions, with Gemini 2.5 Pro achieving the strongest overall performance and near-oracle results under constrained context. These findings underscore the value of holistic, multi-dimensional evaluation for developing truly capable and practical CR assistants, with direct implications for open-source software maintenance and LLM-driven code review tools.

Abstract

Automated code review (CR) is a key application for Large Language Models (LLMs), but progress is hampered by a "reality gap": existing benchmarks evaluate models on isolated sub-tasks using simplified, context-poor data. This fails to reflect the holistic context-rich nature of real-world CR. To bridge this gap, we introduce CodeFuse-CR-Bench, the first comprehensiveness-aware benchmark for repository-level CR evaluation. CodeFuse-CR-Bench comprises 601 high-quality instances from 70 Python projects covering nine Pull-Request (PR) problem domains, where each instance provides rich, multi-faceted context including the associated issue, PR details, and repository state, enabling end-to-end evaluation. Beyond superficial metrics, we also propose a novel evaluation framework that combines rule-based checks for location and syntax with model-based judgments of review quality. We present the first large-scale assessment of state-of-the-art LLMs on this comprehensive CR task. Our results establish crucial baselines and reveal that (1) no single LLM dominates all aspects of CR; (2) Gemini 2.5 Pro achieves the highest comprehensive performance; and (3) different LLMs exhibit varying robustness to redundant context. These findings highlight the necessity of holistic, multi-dimensional evaluation and provide actionable insights for advancing truly intelligent yet practical CR assistants.

CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects

TL;DR

CodeFuse-CR-Bench tackles the realism gap in automated code review by offering a comprehensiveness-aware benchmark that preserves repository-level context for end-to-end CR evaluation in Python. It combines a rule-based and model-based evaluation framework, including a reward model trained on large-scale CR signals and LLMs-as-a-judge to assess semantic quality, alongside precise location and defect matching metrics. The paper constructs 601 high-quality CR task instances from 70 projects through a five-step pipeline, and demonstrates that no single LLM dominates across all dimensions, with Gemini 2.5 Pro achieving the strongest overall performance and near-oracle results under constrained context. These findings underscore the value of holistic, multi-dimensional evaluation for developing truly capable and practical CR assistants, with direct implications for open-source software maintenance and LLM-driven code review tools.

Abstract

Automated code review (CR) is a key application for Large Language Models (LLMs), but progress is hampered by a "reality gap": existing benchmarks evaluate models on isolated sub-tasks using simplified, context-poor data. This fails to reflect the holistic context-rich nature of real-world CR. To bridge this gap, we introduce CodeFuse-CR-Bench, the first comprehensiveness-aware benchmark for repository-level CR evaluation. CodeFuse-CR-Bench comprises 601 high-quality instances from 70 Python projects covering nine Pull-Request (PR) problem domains, where each instance provides rich, multi-faceted context including the associated issue, PR details, and repository state, enabling end-to-end evaluation. Beyond superficial metrics, we also propose a novel evaluation framework that combines rule-based checks for location and syntax with model-based judgments of review quality. We present the first large-scale assessment of state-of-the-art LLMs on this comprehensive CR task. Our results establish crucial baselines and reveal that (1) no single LLM dominates all aspects of CR; (2) Gemini 2.5 Pro achieves the highest comprehensive performance; and (3) different LLMs exhibit varying robustness to redundant context. These findings highlight the necessity of holistic, multi-dimensional evaluation and provide actionable insights for advancing truly intelligent yet practical CR assistants.

Paper Structure

This paper contains 34 sections, 9 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: CR Process
  • Figure 2: The Overview of A Typical CR Task Instance
  • Figure 3: CodeFuse-CR-Bench Construction Pipeline
  • Figure 4: Evaluation Metric Framework
  • Figure 5: LLM-as-a-Judge Prompt
  • ...and 3 more figures