CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

Yuwei Zhao; Ziyang Luo; Yuchen Tian; Hongzhan Lin; Weixiang Yan; Annan Li; Jing Ma

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

Yuwei Zhao, Ziyang Luo, Yuchen Tian, Hongzhan Lin, Weixiang Yan, Annan Li, Jing Ma

TL;DR

Eval Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities.

Abstract

Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities. Our codes and benchmark are available at \url{https://github.com/CodeLLM-Research/CodeJudge-Eval}.

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

TL;DR

Eval Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities.

Abstract

Paper Structure (42 sections, 4 equations, 13 figures, 7 tables)

This paper contains 42 sections, 4 equations, 13 figures, 7 tables.

Introduction
Related Work
CodeJudge-Eval
Overview
Dataset Construction
Data Source
Code Generation
Fine-grained Verdict Construction
Design of Labels
Data Filtering
Filter Problems by Test Case
Filter Solution Codes by Verdict
Experiment
Experimental Setup
Evaluated Methods
...and 27 more sections

Figures (13)

Figure 1: Comparing code generation with code judging task, we observe that a model's ability to generate correct code does not necessarily imply it can accurately judge other codes for the same problem.
Figure 2: An overview of our pipeline for constructing the CodeJudge-Eval benchmark.
Figure 3: A stacked histogram on the number of test cases in the filtered problems. Different filtering thresholds are applied based on different difficulty.
Figure 4: Scaling the number of models' parameters on our CJ-Eval Easy.
Figure 5: Comparative few-shot F1 scores of different models on our CJ-Eval easy.
...and 8 more figures

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

TL;DR

Abstract

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

Authors

TL;DR

Abstract

Table of Contents

Figures (13)