Table of Contents
Fetching ...

PRIME: Policy-Reinforced Iterative Multi-agent Execution for Algorithmic Reasoning in Large Language Models

Jiawei Xu, Zhenyu Yu, Ziqian Bi, Minh Duc Pham, Xiaoyi Qu, Danyang Zhang

TL;DR

PRIME tackles the challenge of long-horizon algorithmic reasoning in large language models by introducing a Policy-Reinforced Iterative Multi-agent Execution framework that coordinates three specialized agents (Executor, Verifier, Coordinator) and optimizes via Group Relative Policy Optimization. The authors pair PRIME with PRIME-Bench, the largest algorithmic reasoning benchmark to date, featuring 86 tasks across 12 categories and requiring execution-trace verification across up to one million steps. Empirically, PRIME achieves an average accuracy of 93.8% on PRIME-Bench, a 250% relative improvement over baselines, with especially large gains on tasks requiring sustained state tracking such as Turing machine simulation and long division. Ablation analyses show iterative verification as the key contributor to robustness, and scale analyses reveal smaller models benefit disproportionately from structured prompting, enabling strong performance with resource-efficient configurations. Together, PRIME and PRIME-Bench establish a new scalable framework and evaluation standard for reliable, algorithmic reasoning in open-source LLMs with broad practical implications for deployment and future research.

Abstract

Large language models have demonstrated remarkable capabilities across diverse reasoning tasks, yet their performance on algorithmic reasoning remains limited. To handle this limitation, we propose PRIME (Policy-Reinforced Iterative Multi-agent Execution), a framework comprising three specialized agents, an executor for step-by-step reasoning, a verifier for constraint checking, and a coordinator for backtracking control, optimized through group relative policy optimization. For comprehensive evaluation, we introduce PRIME-Bench, the largest algorithmic reasoning benchmark to date, comprising 86 tasks across 12 categories with 51,600 instances. Tasks span sorting algorithms, graph and tree structures, automata and state machines, symbolic reasoning, and constraint-based puzzles, with execution traces reaching over one million steps. Compared to baseline approach, PRIME improves average accuracy from 26.8% to 93.8%, a 250% relative gain. The largest improvements occur on tasks requiring sustained state tracking, with Turing machine simulation improving from 9% to 92% and long division from 16% to 94%. Ablation studies identify iterative verification as the primary contributor, preventing the error propagation that causes baseline approaches to fail catastrophically. Analysis across model scales (8B-120B parameters) reveals that smaller models benefit disproportionately, achieving accuracy comparable to models 8x larger.

PRIME: Policy-Reinforced Iterative Multi-agent Execution for Algorithmic Reasoning in Large Language Models

TL;DR

PRIME tackles the challenge of long-horizon algorithmic reasoning in large language models by introducing a Policy-Reinforced Iterative Multi-agent Execution framework that coordinates three specialized agents (Executor, Verifier, Coordinator) and optimizes via Group Relative Policy Optimization. The authors pair PRIME with PRIME-Bench, the largest algorithmic reasoning benchmark to date, featuring 86 tasks across 12 categories and requiring execution-trace verification across up to one million steps. Empirically, PRIME achieves an average accuracy of 93.8% on PRIME-Bench, a 250% relative improvement over baselines, with especially large gains on tasks requiring sustained state tracking such as Turing machine simulation and long division. Ablation analyses show iterative verification as the key contributor to robustness, and scale analyses reveal smaller models benefit disproportionately from structured prompting, enabling strong performance with resource-efficient configurations. Together, PRIME and PRIME-Bench establish a new scalable framework and evaluation standard for reliable, algorithmic reasoning in open-source LLMs with broad practical implications for deployment and future research.

Abstract

Large language models have demonstrated remarkable capabilities across diverse reasoning tasks, yet their performance on algorithmic reasoning remains limited. To handle this limitation, we propose PRIME (Policy-Reinforced Iterative Multi-agent Execution), a framework comprising three specialized agents, an executor for step-by-step reasoning, a verifier for constraint checking, and a coordinator for backtracking control, optimized through group relative policy optimization. For comprehensive evaluation, we introduce PRIME-Bench, the largest algorithmic reasoning benchmark to date, comprising 86 tasks across 12 categories with 51,600 instances. Tasks span sorting algorithms, graph and tree structures, automata and state machines, symbolic reasoning, and constraint-based puzzles, with execution traces reaching over one million steps. Compared to baseline approach, PRIME improves average accuracy from 26.8% to 93.8%, a 250% relative gain. The largest improvements occur on tasks requiring sustained state tracking, with Turing machine simulation improving from 9% to 92% and long division from 16% to 94%. Ablation studies identify iterative verification as the primary contributor, preventing the error propagation that causes baseline approaches to fail catastrophically. Analysis across model scales (8B-120B parameters) reveals that smaller models benefit disproportionately, achieving accuracy comparable to models 8x larger.
Paper Structure (216 sections, 13 theorems, 105 equations, 31 figures, 90 tables, 12 algorithms)

This paper contains 216 sections, 13 theorems, 105 equations, 31 figures, 90 tables, 12 algorithms.

Key Result

Theorem 1

The minimum number of moves required to transfer $n$ disks from source to destination peg is exactly $2^n - 1$, achieved by the recursive algorithm: move $n-1$ disks to auxiliary, move largest disk to destination, move $n-1$ disks from auxiliary to destination.

Figures (31)

  • Figure 1: The PRIME Framework Architecture. The Executor generates reasoning steps, which are immediately validated by the Verifier. Upon constraint violation, the Coordinator manages backtracking via the State Stack. The entire policy is iteratively refined using Group Relative Policy Optimization (GRPO).
  • Figure 2: N-Queens Problem Illustration. The left panel shows a valid 8-Queens solution where no two queens threaten each other (queens cannot share the same row, column, or diagonal). The right panel demonstrates the backtracking search process: when a conflict is detected (red arrows indicating threatened positions), the algorithm backtracks to try alternative placements.
  • Figure 3: Relative improvement from baseline to optimized prompting across model scales. Smaller models exhibit substantially larger relative gains, suggesting that structured prompting compensates for limited model capacity.
  • Figure 4: Accuracy as a function of model size (log scale) under baseline and optimized prompting conditions. Optimized prompting elevates performance across all scales while compressing the performance gap between small and large models.
  • Figure 5: Accuracy as a function of board size $N$ under optimized prompting. All models exhibit graceful degradation with increasing difficulty, with larger models maintaining higher absolute performance throughout.
  • ...and 26 more figures

Theorems & Definitions (43)

  • Definition 1: Sorting Task
  • Definition 2: Execution Trace
  • Definition 3: Counting Sort Invariant
  • Definition 4: Shortest Path Correctness
  • Definition 5: Red-Black Tree Properties
  • Theorem 1: Tower of Hanoi Optimality
  • proof
  • Definition 6: DFA Acceptance
  • Definition 7: DPLL Procedure
  • Definition 8: Shell Sort Gap Sequence
  • ...and 33 more