Table of Contents
Fetching ...

MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation

Yutong Wang, Pengliang Ji, Chaoqun Yang, Kaixin Li, Ming Hu, Jiaoyang Li, Guillaume Sartoretti

TL;DR

The paper tackles unreliability of LLM-based code evaluation in reasoning-heavy tasks by introducing MCTS-Judge, a test-time, Monte Carlo Tree Search framework that yieldsSystem-2-like, multi-perspective reasoning trajectories. It couples a global-local node selection strategy with a fully LLM-driven simulated execution reward to guide trajectory search and produce reliable judgments. Across BigCodeBench, APPS, and HumanEval-X, and five base LLMs, MCTS-Judge delivers substantial accuracy gains, including surpassing larger reasoning models with fewer tokens and without relying on reference code. Additionally, the work reveals a test-time scaling law: increasing test-time resources (depth, rollouts, test cases) consistently improves performance, highlighting the practical viability of test-time computation for LLM-based evaluation.

Abstract

The LLM-as-a-Judge paradigm shows promise for evaluating generative content but lacks reliability in reasoning-intensive scenarios, such as programming. Inspired by recent advances in reasoning models and shifts in scaling laws, we pioneer bringing test-time computation into LLM-as-a-Judge, proposing MCTS-Judge, a resource-efficient, System-2 thinking framework for code correctness evaluation. MCTS-Judge leverages Monte Carlo Tree Search (MCTS) to decompose problems into simpler, multi-perspective evaluations. Through a node-selection strategy that combines self-assessment based on historical actions in the current trajectory and the Upper Confidence Bound for Trees based on prior rollouts, MCTS-Judge balances global optimization and refinement of the current trajectory. We further designed a high-precision, unit-test-level reward mechanism to encourage the Large Language Model (LLM) to perform line-by-line analysis. Extensive experiments on three benchmarks and five LLMs demonstrate the effectiveness of MCTS-Judge, which improves the base model's accuracy from 41% to 80%, surpassing the o1-series models with 3x fewer tokens. Further evaluations validate the superiority of its reasoning trajectory in logic, analytics, thoroughness, and overall quality, while revealing the test-time scaling law of the LLM-as-a-Judge paradigm.

MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation

TL;DR

The paper tackles unreliability of LLM-based code evaluation in reasoning-heavy tasks by introducing MCTS-Judge, a test-time, Monte Carlo Tree Search framework that yieldsSystem-2-like, multi-perspective reasoning trajectories. It couples a global-local node selection strategy with a fully LLM-driven simulated execution reward to guide trajectory search and produce reliable judgments. Across BigCodeBench, APPS, and HumanEval-X, and five base LLMs, MCTS-Judge delivers substantial accuracy gains, including surpassing larger reasoning models with fewer tokens and without relying on reference code. Additionally, the work reveals a test-time scaling law: increasing test-time resources (depth, rollouts, test cases) consistently improves performance, highlighting the practical viability of test-time computation for LLM-based evaluation.

Abstract

The LLM-as-a-Judge paradigm shows promise for evaluating generative content but lacks reliability in reasoning-intensive scenarios, such as programming. Inspired by recent advances in reasoning models and shifts in scaling laws, we pioneer bringing test-time computation into LLM-as-a-Judge, proposing MCTS-Judge, a resource-efficient, System-2 thinking framework for code correctness evaluation. MCTS-Judge leverages Monte Carlo Tree Search (MCTS) to decompose problems into simpler, multi-perspective evaluations. Through a node-selection strategy that combines self-assessment based on historical actions in the current trajectory and the Upper Confidence Bound for Trees based on prior rollouts, MCTS-Judge balances global optimization and refinement of the current trajectory. We further designed a high-precision, unit-test-level reward mechanism to encourage the Large Language Model (LLM) to perform line-by-line analysis. Extensive experiments on three benchmarks and five LLMs demonstrate the effectiveness of MCTS-Judge, which improves the base model's accuracy from 41% to 80%, surpassing the o1-series models with 3x fewer tokens. Further evaluations validate the superiority of its reasoning trajectory in logic, analytics, thoroughness, and overall quality, while revealing the test-time scaling law of the LLM-as-a-Judge paradigm.

Paper Structure

This paper contains 22 sections, 1 equation, 7 figures, 12 tables.

Figures (7)

  • Figure 1: With test-time scaling, our MCTS-Judge method doubles the accuracy of DeepSeek-Coder-V2-16B-Instruct on the APPS benchmark, surpassing o1-series models and Qwen-QwQ-32B, while using $3 \times$ fewer tokens and a smaller model. The circle sizes indicates the relative sizes of the models.
  • Figure 2: MCTS-Judge generates reasoning trajectories with multi-dimensional evaluations using Monte-Carlo Tree Search (MCTS). Each trajectory is iteratively constructed through selection, expansion, simulation, and backpropagation. Our node selection strategy combines LLM-driven self-assessment, based on historical actions in the current trajectory, with the Upper Confidence Bound for Tree (UCT) algorithm based on prior rollouts. This strategy effectively integrates global and local information, balancing the optimization of high-value regions in the search space with the refinement of the current trajectory. Moreover, we introduce a high-precision, unit-test-level reward mechanism, encouraging the LLM to perform line-by-line analysis. This simulated execution reward guides the search process and selects the final answer from candidate trajectories.
  • Figure 3: Flowchart of the fully LLM-driven Simulated Execution Reward Mechanism. $f(\mathbf{t, g})$ represents the prediction of the trajectory, and $h(\mathbf{x})$ represents the simulated execution result.
  • Figure 4: MCTS-Judge (darker colors) significantly enhances LLMs' inherent code evaluation capabilities (lighter colors) across three benchmarks.
  • Figure 5: Increasing test cases ($\alpha$), executions per case ($\delta$), tree depth, and rollouts improves MCTS-Judge's accuracy on APPS, revealing a test-time scaling law.
  • ...and 2 more figures