Table of Contents
Fetching ...

Reducing Cognitive Overhead in Tool Use via Multi-Small-Agent Reinforcement Learning

Dayu Wang, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li

TL;DR

MSARL tackles cognitive overhead in tool-enabled reasoning by decoupling high-level reasoning from tool interpretation through a dedicated Reasoning Agent and specialized Tool Agents. It trains these agents jointly with collaboration-oriented rewards, enabling efficient information flow and scalable interaction patterns. On mathematical problem solving requiring code execution, MSARL achieves higher reasoning stability and final-answer accuracy than single-agent baselines and generalizes to diverse tool-use tasks. The work offers empirical evidence and a modular blueprint for building scalable, specialized-agent AI systems.

Abstract

Recent advances in multi-agent systems highlight the potential of specialized small agents that collaborate via division of labor. Existing tool-integrated reasoning systems, however, often follow a single-agent paradigm in which one large model interleaves long-horizon reasoning with precise tool operations, leading to cognitive-load interference and unstable coordination. We present MSARL, a Multi-Small-Agent Reinforcement Learning framework that explicitly decouples reasoning from tool use. In MSARL, a Reasoning Agent decomposes problems and plans tool invocations, while multiple Tool Agents specialize in specific external tools, each trained via a combination of imitation learning and reinforcement learning with role-specific rewards. On mathematical problem solving with code execution, MSARL significantly improves reasoning stability and final-answer accuracy over single-agent baselines. Moreover, the architecture generalizes to diverse tool-use tasks, demonstrating that cognitive-role decoupling with small agents is a scalable blueprint for multi-agent AI design.

Reducing Cognitive Overhead in Tool Use via Multi-Small-Agent Reinforcement Learning

TL;DR

MSARL tackles cognitive overhead in tool-enabled reasoning by decoupling high-level reasoning from tool interpretation through a dedicated Reasoning Agent and specialized Tool Agents. It trains these agents jointly with collaboration-oriented rewards, enabling efficient information flow and scalable interaction patterns. On mathematical problem solving requiring code execution, MSARL achieves higher reasoning stability and final-answer accuracy than single-agent baselines and generalizes to diverse tool-use tasks. The work offers empirical evidence and a modular blueprint for building scalable, specialized-agent AI systems.

Abstract

Recent advances in multi-agent systems highlight the potential of specialized small agents that collaborate via division of labor. Existing tool-integrated reasoning systems, however, often follow a single-agent paradigm in which one large model interleaves long-horizon reasoning with precise tool operations, leading to cognitive-load interference and unstable coordination. We present MSARL, a Multi-Small-Agent Reinforcement Learning framework that explicitly decouples reasoning from tool use. In MSARL, a Reasoning Agent decomposes problems and plans tool invocations, while multiple Tool Agents specialize in specific external tools, each trained via a combination of imitation learning and reinforcement learning with role-specific rewards. On mathematical problem solving with code execution, MSARL significantly improves reasoning stability and final-answer accuracy over single-agent baselines. Moreover, the architecture generalizes to diverse tool-use tasks, demonstrating that cognitive-role decoupling with small agents is a scalable blueprint for multi-agent AI design.

Paper Structure

This paper contains 33 sections, 23 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Model performance on Math-500 under two prompting regimes.
  • Figure 2: Overview of our method
  • Figure 3: Training prompt templates for MSARL
  • Figure 4: The model's performance (Average Pass@1) at different training checkpoints. Performance saturates after 2k steps.
  • Figure 5: Average reward score during training. The consistent upward trend demonstrates successful and stable learning, with the policy converging in the later stages.