CodeJudge: Evaluating Code Generation with Large Language Models
Weixi Tong, Tianyi Zhang
TL;DR
CodeJudge presents a novel, test-free framework for evaluating the semantic correctness of LLM-generated code by guiding models through slow-thinking prompts. It combines an Analyze then Summarize analysis pipeline with a Taxonomy-Guided Fault Localization score to capture both binary correctness and deviation from user intent. Across multiple datasets and languages, CodeJudge achieves higher correlations with ground-truth semantic correctness and strong binary accuracy compared to token-, embedding-, and prior LLM-based baselines, even when using open-source LLMs. The work highlights practical benefits for scalable code evaluation, analyzes prompting design and failure modes, and release code and data to support reproducibility and future improvements, while acknowledging limitations on challenging benchmarks and prompting sensitivity.
Abstract
Large Language Models (LLMs) have shown promising performance in code generation. However, how to reliably evaluate code generated by LLMs remains an unresolved problem. This paper presents CodeJudge, a code evaluation framework that leverages LLMs to evaluate the semantic correctness of generated code without the need for test cases. We investigate different ways to guide the LLM in performing "slow thinking" to arrive at an in-depth and reliable evaluation. We experimented with four LLMs as evaluators on four code generation datasets and five programming languages. The results show that CodeJudge significantly outperformed existing methods in most settings. Furthermore, compared with a SOTA GPT-3.5-based code evaluation method, CodeJudge achieved better results even when using a much smaller model, Llama-3-8B-Instruct. Our code and datasets are available on GitHub https://github.com/VichyTong/CodeJudge.
