Table of Contents
Fetching ...

Prism: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search

Vahid Majdinasab, Amin Nikanjam, Foutse Khomh

TL;DR

Prism introduces a dynamic benchmarking framework that treats LLM code-generation evaluation as a Markov Decision Process and uses Monte Carlo Tree Search to iteratively uncover challenging programming scenarios. A three-phase evaluation (capability mapping, challenge discovery, comprehensive root-cause analysis) and a coordinated multi-agent system enable end-to-end task simulation, error diagnosis, and solution repair. Across five state-of-the-art LLMs, Prism reveals how model architecture and scale influence performance across concept-difficulty spaces and identifies systematic failure modes not captured by static benchmarks. This dynamic, diagnostics-rich approach provides actionable insights for advancing robust, reliable code-generation in LLMs and can be adapted to domains beyond programming tasks.

Abstract

The rapid advancement of Large Language Models (LLMs) has outpaced traditional evaluation methods. Static benchmarks fail to capture the depth and breadth of LLM capabilities and eventually become obsolete, while most dynamic approaches either rely too heavily on LLM-based evaluation or remain constrained by predefined test sets. We introduce Prism, a flexible, dynamic benchmarking framework designed for comprehensive LLM assessment. Prism builds on three key components: (1) a tree-based state representation that models evaluation as a Markov Decision Process, (2) a Monte Carlo Tree Search algorithm adapted to uncover challenging evaluation scenarios, and (3) a multi-agent evaluation pipeline that enables simultaneous assessment of diverse capabilities. To ensure robust evaluation, Prism integrates structural measurements of tree exploration patterns with performance metrics across difficulty levels, providing detailed diagnostics of error patterns, test coverage, and solution approaches. Through extensive experiments on five state-of-the-art LLMs, we analyze how model architecture and scale influence code generation performance across varying task difficulties. Our results demonstrate Prism's effectiveness as a dynamic benchmark that evolves with model advancements while offering deeper insights into their limitations.

Prism: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search

TL;DR

Prism introduces a dynamic benchmarking framework that treats LLM code-generation evaluation as a Markov Decision Process and uses Monte Carlo Tree Search to iteratively uncover challenging programming scenarios. A three-phase evaluation (capability mapping, challenge discovery, comprehensive root-cause analysis) and a coordinated multi-agent system enable end-to-end task simulation, error diagnosis, and solution repair. Across five state-of-the-art LLMs, Prism reveals how model architecture and scale influence performance across concept-difficulty spaces and identifies systematic failure modes not captured by static benchmarks. This dynamic, diagnostics-rich approach provides actionable insights for advancing robust, reliable code-generation in LLMs and can be adapted to domains beyond programming tasks.

Abstract

The rapid advancement of Large Language Models (LLMs) has outpaced traditional evaluation methods. Static benchmarks fail to capture the depth and breadth of LLM capabilities and eventually become obsolete, while most dynamic approaches either rely too heavily on LLM-based evaluation or remain constrained by predefined test sets. We introduce Prism, a flexible, dynamic benchmarking framework designed for comprehensive LLM assessment. Prism builds on three key components: (1) a tree-based state representation that models evaluation as a Markov Decision Process, (2) a Monte Carlo Tree Search algorithm adapted to uncover challenging evaluation scenarios, and (3) a multi-agent evaluation pipeline that enables simultaneous assessment of diverse capabilities. To ensure robust evaluation, Prism integrates structural measurements of tree exploration patterns with performance metrics across difficulty levels, providing detailed diagnostics of error patterns, test coverage, and solution approaches. Through extensive experiments on five state-of-the-art LLMs, we analyze how model architecture and scale influence code generation performance across varying task difficulties. Our results demonstrate Prism's effectiveness as a dynamic benchmark that evolves with model advancements while offering deeper insights into their limitations.

Paper Structure

This paper contains 39 sections, 29 equations, 21 figures, 4 tables, 1 algorithm.

Figures (21)

  • Figure 1: Prism is an end-to-end, tree-based, multi-phase evaluation framework for dynamic benchmarking of LLMs across different code generation tasks. It allows for a comprehensive evaluation of the model's capabilities by prioritizing and exploring the search space based on the model's performance using MCTS.
  • Figure 2: Tree growth analysis across different models. Left panel (Phase 1) shows the node count per tree depth (lines) and the cumulative number of nodes per depth (shaded areas). Right panel (Phase 2) displays the proportion of nodes for each model at each depth in Phase 2, indicating relative search focus across different tree depths.
  • Figure 3: Radar plots showing the performance of 4o, 4o-M, and L-405b across concepts per each difficulty level. Green: (very easy/easy), Yellow (medium), Red: (hard/very hard). The radial axis represents the success rate (between 0 and 1), while the circumferential axis shows different programming concepts. Higher values indicate better performance.
  • Figure 4: Success ratios for the most challenging programming patterns, grouped by the four most challenging concepts for each model, for 4o, 4o-M, and L-405b. Stacked bars represent performance across difficulty levels. Higher stacks indicate better overall performance. Results highlight model-specific weaknesses in handling complex programming concepts and patterns. Green (very easy/easy), Yellow: (medium), Red: (hard/very hard)
  • Figure 5: The agent interaction and use throughout each phase.
  • ...and 16 more figures