MARTI-MARS$^2$: Scaling Multi-Agent Self-Search via Reinforcement Learning for Code Generation

Shijie Wang; Pengfei Li; Yikun Fu; Kaifeng Liu; Fangyuan Li; Yang Liu; Xiaowei Sun; Zonglin Li; Siyao Zhao; Jian Zhao; Kai Tian; Dong Li; Junqi Gao; Yutong Zhang; Yiqun Chen; Yuqiang Li; Zoe Li; Weinan Zhang; Peng Ye; Shuyue Hu; Lei Bai; Bowen Zhou; Kaiyan Zhang; Biqing Qi

MARTI-MARS$^2$: Scaling Multi-Agent Self-Search via Reinforcement Learning for Code Generation

Shijie Wang, Pengfei Li, Yikun Fu, Kaifeng Liu, Fangyuan Li, Yang Liu, Xiaowei Sun, Zonglin Li, Siyao Zhao, Jian Zhao, Kai Tian, Dong Li, Junqi Gao, Yutong Zhang, Yiqun Chen, Yuqiang Li, Zoe Li, Weinan Zhang, Peng Ye, Shuyue Hu, Lei Bai, Bowen Zhou, Kaiyan Zhang, Biqing Qi

TL;DR

MARTI-MARS$^2$ presents a unified multi-agent reinforcement learning framework that treats collaborative self-search as a dynamic environment to scale code-generation reasoning. By transitioning from homogeneous to heterogeneous agent configurations and introducing an efficient test-time strategy (MARS$^2$-T+), the approach achieves higher reinforcement learning ceilings and robust test-time scaling, driven by policy diversity and structured error feedback. The framework combines a group-aware optimization objective with asynchronous training, refined AB-MCTS-inspired search, and a learned reward model to stabilize long-horizon reasoning. Across 8B, 14B, and 32B model scales on challenging benchmarks, the method demonstrates significant improvements in pass@1 and reveals a multi-agent scaling law that diversity among agents enhances both performance and exploration capacity, offering practical implications for scalable intelligent coding systems.

Abstract

While the complex reasoning capability of Large Language Models (LLMs) has attracted significant attention, single-agent systems often encounter inherent performance ceilings in complex tasks such as code generation. Multi-agent collaboration offers a promising avenue to transcend these boundaries. However, existing frameworks typically rely on prompt-based test-time interactions or multi-role configurations trained with homogeneous parameters, limiting error correction capabilities and strategic diversity. In this paper, we propose a Multi-Agent Reinforced Training and Inference Framework with Self-Search Scaling (MARTI-MARS2), which integrates policy learning with multi-agent tree search by formulating the multi-agent collaborative exploration process as a dynamic and learnable environment. By allowing agents to iteratively explore and refine within the environment, the framework facilitates evolution from parameter-sharing homogeneous multi-role training to heterogeneous multi-agent training, breaking through single-agent capability limits. We also introduce an efficient inference strategy MARTI-MARS2-T+ to fully exploit the scaling potential of multi-agent collaboration at test time. We conduct extensive experiments across varied model scales (8B, 14B, and 32B) on challenging code generation benchmarks. Utilizing two collaborating 32B models, MARTI-MARS2 achieves 77.7%, outperforming strong baselines like GPT-5.1. Furthermore, MARTI-MARS2 reveals a novel scaling law: shifting from single-agent to homogeneous multi-role and ultimately to heterogeneous multi-agent paradigms progressively yields higher RL performance ceilings, robust TTS capabilities, and greater policy diversity, suggesting that policy diversity is critical for scaling intelligence via multi-agent reinforcement learning.

MARTI-MARS$^2$: Scaling Multi-Agent Self-Search via Reinforcement Learning for Code Generation

TL;DR

MARTI-MARS

presents a unified multi-agent reinforcement learning framework that treats collaborative self-search as a dynamic environment to scale code-generation reasoning. By transitioning from homogeneous to heterogeneous agent configurations and introducing an efficient test-time strategy (MARS

-T+), the approach achieves higher reinforcement learning ceilings and robust test-time scaling, driven by policy diversity and structured error feedback. The framework combines a group-aware optimization objective with asynchronous training, refined AB-MCTS-inspired search, and a learned reward model to stabilize long-horizon reasoning. Across 8B, 14B, and 32B model scales on challenging benchmarks, the method demonstrates significant improvements in pass@1 and reveals a multi-agent scaling law that diversity among agents enhances both performance and exploration capacity, offering practical implications for scalable intelligent coding systems.

Abstract

Paper Structure (76 sections, 19 equations, 13 figures, 9 tables)

This paper contains 76 sections, 19 equations, 13 figures, 9 tables.

Introduction
Method
Overview
The training paradigm of MARS$^2$
Multi-Agent Instantiation
Data Collection and Advantage Calculation
Data dispatching and Dynamic Training
Distributing Data.
Asynchronously triggered Dynamic Training.
The optimization objective of MARS$^2$
MARS$^2$.
MARS$^2$+.
Test Time Scaling of MARS$^2$
Multi-Agent Refinement-based tree search
Error-Feedback Integration.
...and 61 more sections

Figures (13)

Figure 1: Multi-agent scaling laws and performance of MARTI-MARS$^2$. The left panel shows the scaling advantages in Nemotron-32B of the homogeneous multi-role method (Homo-MARS$^2$) and heterogeneous multi-agent (Nemotron-32B and Qwen3-32B) method (Heter-MARS$^2$) over single-agent (GRPO). The right panel compares the performance of MARTI-MARS$^2$ against leading open-source and closed-source LLMs.
Figure 2: The framework of MARTI-MARS$^{2}$. In RL stage, the multi-role and multi-agent tree search is modeled as a learnable dynamic environment, enabling agents to improve reasoning capabilities via tree-based GRPO algorithm. In TTS stage, an enhanced method MARS$^2$-T+ is introduced, incorporating error message feedback, dynamic depth-guided exploration, and pre-trained reward model to achieve efficient inference.
Figure 3: Experimental results of Homo-MARS$^2$ and baseline methods on LCB benchmarks. The inference budgets are $N=60$.
Figure 4: Performance of Homo-MARS$^2$ and GRPO on Qwen3-8B across training steps.
Figure 5: Pass@1 results of Heter-MARS$^2$ and baseline methods on LCB benchmarks.
...and 8 more figures

MARTI-MARS$^2$: Scaling Multi-Agent Self-Search via Reinforcement Learning for Code Generation

TL;DR

Abstract

MARTI-MARS$^2$: Scaling Multi-Agent Self-Search via Reinforcement Learning for Code Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)