Table of Contents
Fetching ...

MASPRM: Multi-Agent System Process Reward Model

Milad Yazdani, Mahdi Mostajabdaveh, Zirui Zhou, Ying Xiong

TL;DR

MASPRM introduces a per-action, per-agent Process Reward Model that provides dense, progress-aware value signals to guide inference-time search in Multi-Agent Systems. Trained from MAS-MCTS rollouts without manual step annotations, MASPRM improves compute efficiency and solution quality by steering beam search and MCTS toward promising inter-agent states, and by optionally combining with an Outcome Reward Model at termination. Empirical results show substantial exact-match gains on GSM8K and MATH, including a notable zero-shot transfer from GSM8K to MATH, and demonstrate that process-level guidance outperforms policy-only baselines while being complementary to ORM. The approach offers a practical, plug-in mechanism to stabilize and accelerate multi-agent reasoning across domains under fixed compute budgets.

Abstract

Practical deployment of Multi-Agent Systems (MAS) demands strong test-time performance, motivating methods that guide inference-time search and selectively spend compute to improve quality. We present the Multi-Agent System Process Reward Model (MASPRM). It assigns per-action, per-agent values to partial inter-agent transcripts and acts as an inference-time controller. MASPRM is trained from multi-agent Monte Carlo Tree Search (MCTS) rollouts without requiring step-level human annotations, by propagating returns to local targets. At inference, MASPRM guides step-level beam search and MCTS, focusing computation on promising branches and pruning early. On GSM8K and MATH, MASPRM-guided decoding with an outcome reward model (ORM) applied to the final answer, improves exact match (EM) over a single straight-through MAS pass by $+30.7$ and $+22.9$ points, respectively. A MASPRM trained on GSM8K transfers zero-shot to MATH without retraining, adding $8.4$ EM points at the same budget. MASPRM is a plug-in value model that estimates per-agent progress and complements verifier-style decoders, enabling more reliable, compute-aware multi-agent reasoning. Code: https://github.com/milad1378yz/MASPRM

MASPRM: Multi-Agent System Process Reward Model

TL;DR

MASPRM introduces a per-action, per-agent Process Reward Model that provides dense, progress-aware value signals to guide inference-time search in Multi-Agent Systems. Trained from MAS-MCTS rollouts without manual step annotations, MASPRM improves compute efficiency and solution quality by steering beam search and MCTS toward promising inter-agent states, and by optionally combining with an Outcome Reward Model at termination. Empirical results show substantial exact-match gains on GSM8K and MATH, including a notable zero-shot transfer from GSM8K to MATH, and demonstrate that process-level guidance outperforms policy-only baselines while being complementary to ORM. The approach offers a practical, plug-in mechanism to stabilize and accelerate multi-agent reasoning across domains under fixed compute budgets.

Abstract

Practical deployment of Multi-Agent Systems (MAS) demands strong test-time performance, motivating methods that guide inference-time search and selectively spend compute to improve quality. We present the Multi-Agent System Process Reward Model (MASPRM). It assigns per-action, per-agent values to partial inter-agent transcripts and acts as an inference-time controller. MASPRM is trained from multi-agent Monte Carlo Tree Search (MCTS) rollouts without requiring step-level human annotations, by propagating returns to local targets. At inference, MASPRM guides step-level beam search and MCTS, focusing computation on promising branches and pruning early. On GSM8K and MATH, MASPRM-guided decoding with an outcome reward model (ORM) applied to the final answer, improves exact match (EM) over a single straight-through MAS pass by and points, respectively. A MASPRM trained on GSM8K transfers zero-shot to MATH without retraining, adding EM points at the same budget. MASPRM is a plug-in value model that estimates per-agent progress and complements verifier-style decoders, enabling more reliable, compute-aware multi-agent reasoning. Code: https://github.com/milad1378yz/MASPRM

Paper Structure

This paper contains 56 sections, 19 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Token-accuracy trade-off on GSM8K. Each point reports exact match (y-axis) versus average test-time tokens (x-axis; $\times 10^3$). MASPRM consistently shifts the frontier upward: it improves Maj@5 and step-wise beam search (SBS) under matched budgets, and when paired with MCTS and an ORM, it reaches $74.6$% EM at $\sim$19k tokens, while Greedy sits near $43.9$% EM at $\sim$1.6k tokens.
  • Figure 2: Search-generated supervision. Left: Extracted rollouts yield edge-level estimates $\hat{Q}(s,a)$; for each child $s'=\mathrm{next}(s,a)$ we set the regression target $y=\hat{Q}(s,a)$ and train the MASPRM$V_\phi(s')$ accordingly. Right: Uses of MASPRM in multi-agent systems, including inference-time guidance (SBS and MCTS via $V_\phi$ and $\hat{Q}$).
  • Figure 3: Example MAS (four agents) and schedule. Dashed directed edges define admissible routing and the order of agent outputs $o_1$-$o_4$; one agent acts per depth according to $\sigma$.
  • Figure 4: Stylized transcript (example). A 4-agent pipeline produces outputs $o_1$-$o_4$. Green is correct ($a{=}c{=}7\Rightarrow 5a{-}3c=14$); red is incorrect ($35{-}21\neq 15$).
  • Figure 5: MCTS over the unrolled tree (example). Numbers on nodes are illustrative mean values $\overline V$; green/red leaves denote $R\in\{+1,-1\}$.