Table of Contents
Fetching ...

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

Alexander Zhang, Marcus Dong, Jiaheng Liu, Wei Zhang, Yejie Wang, Jian Yang, Ge Zhang, Tianyu Liu, Zhongyuan Peng, Yingshui Tan, Yuanxing Zhang, Zhexu Wang, Weixun Wang, Yancheng He, Ken Deng, Wangchunshu Zhou, Wenhao Huang, Zhaoxiang Zhang

TL;DR

CodeCriticBench introduces a holistic, code-focused benchmark to evaluate LLMs' critique abilities across code generation and code QA. By combining basic correctness judgments with advanced, fine-grained evaluation checklists and calibrating scores against human judgments, it enables robust cross-model comparisons over 4,300 samples with varying difficulty. The study demonstrates scaling trends, task-dependent performance, and the value of CoT-like, fine-grained evaluation in aligning model critiques with human standards. This benchmark has practical implications for advancing code-review and coding-assistance tools in real-world software development.

Abstract

The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of LLMs has drawn great attention and several critique benchmarks have been proposed. However, existing critique benchmarks usually have the following limitations: (1). Focusing on diverse reasoning tasks in general domains and insufficient evaluation on code tasks (e.g., only covering code generation task), where the difficulty of queries is relatively easy (e.g., the code queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive evaluation from different dimensions. To address these limitations, we introduce a holistic code critique benchmark for LLMs called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties. Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics, where fine-grained evaluation checklists are well-designed for advanced settings. Finally, we conduct extensive experimental results of existing LLMs, which show the effectiveness of CodeCriticBench.

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

TL;DR

CodeCriticBench introduces a holistic, code-focused benchmark to evaluate LLMs' critique abilities across code generation and code QA. By combining basic correctness judgments with advanced, fine-grained evaluation checklists and calibrating scores against human judgments, it enables robust cross-model comparisons over 4,300 samples with varying difficulty. The study demonstrates scaling trends, task-dependent performance, and the value of CoT-like, fine-grained evaluation in aligning model critiques with human standards. This benchmark has practical implications for advancing code-review and coding-assistance tools in real-world software development.

Abstract

The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of LLMs has drawn great attention and several critique benchmarks have been proposed. However, existing critique benchmarks usually have the following limitations: (1). Focusing on diverse reasoning tasks in general domains and insufficient evaluation on code tasks (e.g., only covering code generation task), where the difficulty of queries is relatively easy (e.g., the code queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive evaluation from different dimensions. To address these limitations, we introduce a holistic code critique benchmark for LLMs called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties. Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics, where fine-grained evaluation checklists are well-designed for advanced settings. Finally, we conduct extensive experimental results of existing LLMs, which show the effectiveness of CodeCriticBench.

Paper Structure

This paper contains 25 sections, 3 equations, 29 figures, 29 tables.

Figures (29)

  • Figure 1: Illustration of the Basic Critique Evaluation and Advanced Critique Evaluation.
  • Figure 2: Illustration of data collection process.
  • Figure 3: Scaling law on basic critique evaluation (ACC) across models. "*" indicates an estimated parameter size.
  • Figure 4: Comparison across different models on "Code QA" (Basic Critique Evaluation).
  • Figure 5: Model performance (ACC) on different difficulty levels (Basic Critique Evaluation) .
  • ...and 24 more figures