Table of Contents
Fetching ...

Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice

Ruida Hu, Xinchen Wang, Xin-Cheng Wen, Zhao Zhang, Bo Jiang, Pengfei Gao, Chao Peng, Cuiyun Gao

TL;DR

ContextCRBench introduces a large-scale, context-rich benchmark for fine-grained code review, addressing three key weaknesses of prior benchmarks: missing semantic context, data quality issues, and coarse evaluation granularity. The paper describes a three-module data pipeline (raw data crawling, comprehensive context extraction, and multi-stage filtering) that produces 67,910 high-quality, context-enriched samples across nine languages and supports three tasks: hunk-level quality estimation, line-level defect localization, and line-level review comment generation. It evaluates eight LLMs, showing textual context (issue/PR descriptions) generally boosts performance more than code context, while current models still fall short of human-level review. The authors demonstrate practical impact by deploying ContextCRBench at ByteDance, achieving a $61.98\%$ relative improvement in a self-evolving code review system, underscoring the benchmark’s industrial value and guiding future research toward better semantic understanding in AI-assisted code review.

Abstract

Code review is a cornerstone of software quality assurance, and recent advances in Large Language Models (LLMs) have shown promise in its automation. However, existing benchmarks for LLM-based code review face three major limitations. Lack of semantic context: most benchmarks provide only code diffs without textual information such as issue descriptions, which are crucial for understanding developer intent. Data quality issues: without rigorous validation, many samples are noisy-e.g., reviews on outdated or irrelevant code-reducing evaluation reliability. Coarse granularity: most benchmarks operate at the file or commit level, overlooking the fine-grained, line-level reasoning essential for precise review. We introduce ContextCRBench, a high-quality, context-rich benchmark for fine-grained LLM evaluation in code review. Our construction pipeline comprises: Raw Data Crawling, collecting 153.7K issues and pull requests from top-tier repositories; Comprehensive Context Extraction, linking issue-PR pairs for textual context and extracting the full surrounding function or class for code context; and Multi-stage Data Filtering, combining rule-based and LLM-based validation to remove outdated, malformed, or low-value samples, resulting in 67,910 context-enriched entries. ContextCRBench supports three evaluation scenarios aligned with the review workflow: hunk-level quality assessment, line-level defect localization, and line-level comment generation. Evaluating eight leading LLMs (four closed-source and four open-source) reveals that textual context yields greater performance gains than code context alone, while current LLMs remain far from human-level review ability. Deployed at ByteDance, ContextCRBench drives a self-evolving code review system, improving performance by 61.98% and demonstrating its robustness and industrial utility. https://github.com/kinesiatricssxilm14/ContextCRBench.

Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice

TL;DR

ContextCRBench introduces a large-scale, context-rich benchmark for fine-grained code review, addressing three key weaknesses of prior benchmarks: missing semantic context, data quality issues, and coarse evaluation granularity. The paper describes a three-module data pipeline (raw data crawling, comprehensive context extraction, and multi-stage filtering) that produces 67,910 high-quality, context-enriched samples across nine languages and supports three tasks: hunk-level quality estimation, line-level defect localization, and line-level review comment generation. It evaluates eight LLMs, showing textual context (issue/PR descriptions) generally boosts performance more than code context, while current models still fall short of human-level review. The authors demonstrate practical impact by deploying ContextCRBench at ByteDance, achieving a relative improvement in a self-evolving code review system, underscoring the benchmark’s industrial value and guiding future research toward better semantic understanding in AI-assisted code review.

Abstract

Code review is a cornerstone of software quality assurance, and recent advances in Large Language Models (LLMs) have shown promise in its automation. However, existing benchmarks for LLM-based code review face three major limitations. Lack of semantic context: most benchmarks provide only code diffs without textual information such as issue descriptions, which are crucial for understanding developer intent. Data quality issues: without rigorous validation, many samples are noisy-e.g., reviews on outdated or irrelevant code-reducing evaluation reliability. Coarse granularity: most benchmarks operate at the file or commit level, overlooking the fine-grained, line-level reasoning essential for precise review. We introduce ContextCRBench, a high-quality, context-rich benchmark for fine-grained LLM evaluation in code review. Our construction pipeline comprises: Raw Data Crawling, collecting 153.7K issues and pull requests from top-tier repositories; Comprehensive Context Extraction, linking issue-PR pairs for textual context and extracting the full surrounding function or class for code context; and Multi-stage Data Filtering, combining rule-based and LLM-based validation to remove outdated, malformed, or low-value samples, resulting in 67,910 context-enriched entries. ContextCRBench supports three evaluation scenarios aligned with the review workflow: hunk-level quality assessment, line-level defect localization, and line-level comment generation. Evaluating eight leading LLMs (four closed-source and four open-source) reveals that textual context yields greater performance gains than code context alone, while current LLMs remain far from human-level review ability. Deployed at ByteDance, ContextCRBench drives a self-evolving code review system, improving performance by 61.98% and demonstrating its robustness and industrial utility. https://github.com/kinesiatricssxilm14/ContextCRBench.

Paper Structure

This paper contains 39 sections, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Examples of the diff hunk for illustrating the challenges of current code review benchmarks. (a) illustrates a high-quality entry where red boxes highlight the issue (problem), the PR (solution), and the corresponding diff hunk (the code change). The starred line of code indicates the actual location of the review comment.
  • Figure 2: The automated code review workflow and our three evaluation tasks.
  • Figure 3: The pipeline of ContextCRBench construction. It consists of three main modules: raw data crawling module for collecting large-scale issues and PRs, comprehensive context extraction module for constructing rich textual and code context, and multi-stage data filtering module for filtering noisy data and producing the final high-quality benchmark.
  • Figure 4: A case study demonstrating the positive impact of textual context.
  • Figure 5: Performance across three code review tasks for different programming languages.