Table of Contents
Fetching ...

MARS: Optimizing Dual-System Deep Research via Multi-Agent Reinforcement Learning

Guoxin Chen, Zile Qiao, Wenqing Wang, Donglei Yu, Xuanzhong Chen, Hao Sun, Minpeng Liao, Kai Fan, Yong Jiang, Penguin Xie, Wayne Xin Zhao, Ruihua Song, Fei Huang

TL;DR

MARS tackles the problem of overanalysis and static knowledge limitations in large language models by introducing a dual-system framework that couples System 1's fast, intuitive processing with System 2's deliberate reasoning. It integrates external tools (e.g., Google Search, Google Scholar, Python Interpreter) and a data-curation pipeline, all trained under a multi-agent reinforcement learning objective that extends Group Relative Policy Optimization. Key innovations include a bin-packing strategy for handling large retrieved content, advantage pre-computation with balanced sampling to synchronize System 1 and System 2 learning, and a joint GRPO-based training objective. Empirically, MARS achieves significant gains on Humanity’s Last Exam and multiple knowledge-intensive tasks, narrowing the gap to proprietary models while using far fewer parameters, demonstrating strong potential for dynamic information environments and complex reasoning across domains.

Abstract

Large Reasoning Models (LRMs) often exhibit a tendency for overanalysis in simple tasks, where the models excessively utilize System 2-type, deliberate reasoning, leading to inefficient token generation. Furthermore, these models face challenges in adapting their reasoning capabilities to rapidly changing environments due to the static nature of their pretraining data. To address these issues, advancing Large Language Models (LLMs) for complex reasoning tasks requires innovative approaches that bridge intuitive and deliberate cognitive processes, akin to human cognition's dual-system dynamic. This paper introduces a Multi-Agent System for Deep ReSearch (MARS) enabling seamless integration of System 1's fast, intuitive thinking with System 2's deliberate reasoning within LLMs. MARS strategically integrates multiple external tools, such as Google Search, Google Scholar, and Python Interpreter, to access up-to-date information and execute complex computations, while creating a specialized division of labor where System 1 efficiently processes and summarizes high-volume external information, providing distilled insights that expand System 2's reasoning context without overwhelming its capacity. Furthermore, we propose a multi-agent reinforcement learning framework extending Group Relative Policy Optimization to simultaneously optimize both systems with multi-turn tool interactions, bin-packing optimization, and sample balancing strategies that enhance collaborative efficiency. Extensive experiments demonstrate MARS achieves substantial improvements of 3.86% on the challenging Humanity's Last Exam (HLE) benchmark and an average gain of 8.9% across 7 knowledge-intensive tasks, validating the effectiveness of our dual-system paradigm for complex reasoning in dynamic information environments.

MARS: Optimizing Dual-System Deep Research via Multi-Agent Reinforcement Learning

TL;DR

MARS tackles the problem of overanalysis and static knowledge limitations in large language models by introducing a dual-system framework that couples System 1's fast, intuitive processing with System 2's deliberate reasoning. It integrates external tools (e.g., Google Search, Google Scholar, Python Interpreter) and a data-curation pipeline, all trained under a multi-agent reinforcement learning objective that extends Group Relative Policy Optimization. Key innovations include a bin-packing strategy for handling large retrieved content, advantage pre-computation with balanced sampling to synchronize System 1 and System 2 learning, and a joint GRPO-based training objective. Empirically, MARS achieves significant gains on Humanity’s Last Exam and multiple knowledge-intensive tasks, narrowing the gap to proprietary models while using far fewer parameters, demonstrating strong potential for dynamic information environments and complex reasoning across domains.

Abstract

Large Reasoning Models (LRMs) often exhibit a tendency for overanalysis in simple tasks, where the models excessively utilize System 2-type, deliberate reasoning, leading to inefficient token generation. Furthermore, these models face challenges in adapting their reasoning capabilities to rapidly changing environments due to the static nature of their pretraining data. To address these issues, advancing Large Language Models (LLMs) for complex reasoning tasks requires innovative approaches that bridge intuitive and deliberate cognitive processes, akin to human cognition's dual-system dynamic. This paper introduces a Multi-Agent System for Deep ReSearch (MARS) enabling seamless integration of System 1's fast, intuitive thinking with System 2's deliberate reasoning within LLMs. MARS strategically integrates multiple external tools, such as Google Search, Google Scholar, and Python Interpreter, to access up-to-date information and execute complex computations, while creating a specialized division of labor where System 1 efficiently processes and summarizes high-volume external information, providing distilled insights that expand System 2's reasoning context without overwhelming its capacity. Furthermore, we propose a multi-agent reinforcement learning framework extending Group Relative Policy Optimization to simultaneously optimize both systems with multi-turn tool interactions, bin-packing optimization, and sample balancing strategies that enhance collaborative efficiency. Extensive experiments demonstrate MARS achieves substantial improvements of 3.86% on the challenging Humanity's Last Exam (HLE) benchmark and an average gain of 8.9% across 7 knowledge-intensive tasks, validating the effectiveness of our dual-system paradigm for complex reasoning in dynamic information environments.

Paper Structure

This paper contains 34 sections, 10 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of Dual-System Collaborative Framework in our MARS.
  • Figure 2: Demonstration of GRPO with multi-agent reinforcement learning in our MARS.
  • Figure 3: Comprehensive analysis of our RL training process. The $x$-axis represents training steps for all subfigures. (a-c) Core performance metrics on HLE score (randomly select 320 questions), training reward, and tool usage frequency per question. (d-f) Evolution of tool selection preferences across three available tools. While Google Search emerges as the predominantly chosen tool due to our training data distribution, we maintain all tools to preserve System 2's autonomous tool selection capability for diverse scenarios. (g-i) Response length distributions showing minimum (predominantly System 1), mean, and maximum (predominantly System 2) response lengths. Training was terminated after step 150 due to consistently exceeding our preset length constraints.
  • Figure 4: Our data curation pipeline.
  • Figure 5: Distribution of Correct Number in Best-of-N ($N=16$). Questions answered correctly 1-12 times were retained for training, while those answered 0 times (potentially ambiguous or lacking definitive solutions) or >12 times (too trivial) were excluded.