Table of Contents
Fetching ...

MarsRL: Advancing Multi-Agent Reasoning System via Reinforcement Learning with Agentic Pipeline Parallelism

Shulin Liu, Dong Du, Tao Yang, Yang Li, Boyu Qiu

TL;DR

MarsRL tackles the bottleneck of reasoning depth in LLMs by deploying an agentic reinforcement learning framework that jointly optimizes Solver, Verifier, and Corrector within a pipeline-parallel training setup. It introduces per-agent rewards to decouple credit assignment and leverages grouped, segmented rollouts to handle ultra-long trajectories efficiently. Empirical results on Qwen3-30B-A3B-Thinking-2507 show substantial gains on AIME2025 and BeyondAIME, with performance surpassing larger models and demonstrating cross-Solver generalization for the Verifier and Corrector. The work advances practical multi-agent reasoning by addressing reward noise and training efficiency, enabling broader applicability across complex reasoning tasks.

Abstract

Recent progress in large language models (LLMs) has been propelled by reinforcement learning with verifiable rewards (RLVR) and test-time scaling. However, the limited output length of LLMs constrains the depth of reasoning attainable in a single inference process. Multi-agent reasoning systems offer a promising alternative by employing multiple agents including Solver, Verifier, and Corrector, to iteratively refine solutions. While effective in closed-source models like Gemini 2.5 Pro, they struggle to generalize to open-source models due to insufficient critic and correction capabilities. To address this, we propose MarsRL, a novel reinforcement learning framework with agentic pipeline parallelism, designed to jointly optimize all agents in the system. MarsRL introduces agent-specific reward mechanisms to mitigate reward noise and employs pipeline-inspired training to enhance efficiency in handling long trajectories. Applied to Qwen3-30B-A3B-Thinking-2507, MarsRL improves AIME2025 accuracy from 86.5% to 93.3% and BeyondAIME from 64.9% to 73.8%, even surpassing Qwen3-235B-A22B-Thinking-2507. These findings highlight the potential of MarsRL to advance multi-agent reasoning systems and broaden their applicability across diverse reasoning tasks.

MarsRL: Advancing Multi-Agent Reasoning System via Reinforcement Learning with Agentic Pipeline Parallelism

TL;DR

MarsRL tackles the bottleneck of reasoning depth in LLMs by deploying an agentic reinforcement learning framework that jointly optimizes Solver, Verifier, and Corrector within a pipeline-parallel training setup. It introduces per-agent rewards to decouple credit assignment and leverages grouped, segmented rollouts to handle ultra-long trajectories efficiently. Empirical results on Qwen3-30B-A3B-Thinking-2507 show substantial gains on AIME2025 and BeyondAIME, with performance surpassing larger models and demonstrating cross-Solver generalization for the Verifier and Corrector. The work advances practical multi-agent reasoning by addressing reward noise and training efficiency, enabling broader applicability across complex reasoning tasks.

Abstract

Recent progress in large language models (LLMs) has been propelled by reinforcement learning with verifiable rewards (RLVR) and test-time scaling. However, the limited output length of LLMs constrains the depth of reasoning attainable in a single inference process. Multi-agent reasoning systems offer a promising alternative by employing multiple agents including Solver, Verifier, and Corrector, to iteratively refine solutions. While effective in closed-source models like Gemini 2.5 Pro, they struggle to generalize to open-source models due to insufficient critic and correction capabilities. To address this, we propose MarsRL, a novel reinforcement learning framework with agentic pipeline parallelism, designed to jointly optimize all agents in the system. MarsRL introduces agent-specific reward mechanisms to mitigate reward noise and employs pipeline-inspired training to enhance efficiency in handling long trajectories. Applied to Qwen3-30B-A3B-Thinking-2507, MarsRL improves AIME2025 accuracy from 86.5% to 93.3% and BeyondAIME from 64.9% to 73.8%, even surpassing Qwen3-235B-A22B-Thinking-2507. These findings highlight the potential of MarsRL to advance multi-agent reasoning systems and broaden their applicability across diverse reasoning tasks.

Paper Structure

This paper contains 16 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of the Verifier-Corrector Reasoning System.
  • Figure 2: Modeling the V-C Reasoning System in Agentic RL.
  • Figure 3: Illustration of agentic pipeline parallelism and grouped rollouts.
  • Figure 4: Evaluation results on the AIME-2025 benchmark for different sampling strategies.
  • Figure 5: The training dynamics of the Verifier's performance for error detection.
  • ...and 2 more figures