Table of Contents
Fetching ...

ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems

Bohan Yao, Shiva Krishna Reddy Malay, Vikas Yadav

TL;DR

The paper tackles the challenge of designing generalizable multi-agent systems by showing that simple CoT baselines remain highly competitive and that focusing on the core reasoning unit yields greater gains. It introduces the Agentic Reasoning Module (ARM), which replaces each CoT step with a modular, agentic block discovered via a reflection-guided evolutionary search, coupled with a decoupled discovery of a meta-policy ($\\pi^*$). The step-generator $m^*$ is found through a scaffolded objective within a reference CoT trace, while the meta-policy is optimized with a surrogate and transferred zero-shot to $m^*$. Across multiple benchmarks and foundation models, ARM demonstrates superior performance and strong generalization without task-specific re-optimization, suggesting that reusable, well-structured reasoning units can outperform brittle, hand-tuned MAS designs.

Abstract

Large Language Model (LLM)-powered Multi-agent systems (MAS) have achieved state-of-the-art results on various complex reasoning tasks. Recent works have proposed techniques to automate the design of MASes, eliminating the need for manual engineering. However, these techniques perform poorly, often achieving similar or inferior performance to simple baselines. Furthermore, they require computationally expensive re-discovery of architectures for each new task domain and expensive data annotation on domains without existing labeled validation sets. A critical insight is that simple Chain of Thought (CoT) reasoning often performs competitively with these complex systems, suggesting that the fundamental reasoning unit of MASes, CoT, warrants further investigation. To this end, we present a new paradigm for automatic MAS design that pivots the focus to optimizing CoT reasoning. We introduce the Agentic Reasoning Module (ARM), an agentic generalization of CoT where each granular reasoning step is executed by a specialized reasoning module. This module is discovered through a tree search over the code space, starting from a simple CoT module and evolved using mutations informed by reflection on execution traces. The resulting ARM acts as a versatile reasoning building block which can be utilized as a direct recursive loop or as a subroutine in a learned meta-orchestrator. Our approach significantly outperforms both manually designed MASes and state-of-the-art automatic MAS design methods. Crucially, MASes built with ARM exhibit superb generalization, maintaining high performance across different foundation models and task domains without further optimization.

ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems

TL;DR

The paper tackles the challenge of designing generalizable multi-agent systems by showing that simple CoT baselines remain highly competitive and that focusing on the core reasoning unit yields greater gains. It introduces the Agentic Reasoning Module (ARM), which replaces each CoT step with a modular, agentic block discovered via a reflection-guided evolutionary search, coupled with a decoupled discovery of a meta-policy (). The step-generator is found through a scaffolded objective within a reference CoT trace, while the meta-policy is optimized with a surrogate and transferred zero-shot to . Across multiple benchmarks and foundation models, ARM demonstrates superior performance and strong generalization without task-specific re-optimization, suggesting that reusable, well-structured reasoning units can outperform brittle, hand-tuned MAS designs.

Abstract

Large Language Model (LLM)-powered Multi-agent systems (MAS) have achieved state-of-the-art results on various complex reasoning tasks. Recent works have proposed techniques to automate the design of MASes, eliminating the need for manual engineering. However, these techniques perform poorly, often achieving similar or inferior performance to simple baselines. Furthermore, they require computationally expensive re-discovery of architectures for each new task domain and expensive data annotation on domains without existing labeled validation sets. A critical insight is that simple Chain of Thought (CoT) reasoning often performs competitively with these complex systems, suggesting that the fundamental reasoning unit of MASes, CoT, warrants further investigation. To this end, we present a new paradigm for automatic MAS design that pivots the focus to optimizing CoT reasoning. We introduce the Agentic Reasoning Module (ARM), an agentic generalization of CoT where each granular reasoning step is executed by a specialized reasoning module. This module is discovered through a tree search over the code space, starting from a simple CoT module and evolved using mutations informed by reflection on execution traces. The resulting ARM acts as a versatile reasoning building block which can be utilized as a direct recursive loop or as a subroutine in a learned meta-orchestrator. Our approach significantly outperforms both manually designed MASes and state-of-the-art automatic MAS design methods. Crucially, MASes built with ARM exhibit superb generalization, maintaining high performance across different foundation models and task domains without further optimization.

Paper Structure

This paper contains 39 sections, 4 theorems, 17 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

Let $w_t$ be the probability the baseline remains in $\mathcal{S}_{ok}$ up to step $t$. Then for a universal constant $C>0$, Moreover, under Assumption ass:conditional-stability, $A_m(s)$ is dominated by error-rate reduction:

Figures (3)

  • Figure 1: An illustration of the proposed ARM module on the left and the meta policy on the right using "Self refine" as an example MAS. The ARM module takes a question and previous reasoning steps and executes a MAS to get the next step. The meta policy uses ARM as a sub-module and orchestrates the overarching global strategy. Note that this is for illustration only, the actual step generator and the meta policy discovered by Algorithm-1 is more complex (See Appendix).
  • Figure 2: Validation of the meta-policy transfer for top discovered policies. The table compares performance using the simple surrogate $m_{CoT}$ (CoT Baseline) versus the powerful ARM module $m^*$ (Meta Policy). The intermediate CoT$\to$Meta column isolates the performance gain from the superior $m^*$ module alone by evaluating it on states generated by the baseline.
  • Figure 3: Comparison of LLM judged per-step success rates between the baseline Chain-of-Thought (CoT) and multiple ARM (CriticChainOfThought) variants. CoT appears first, followed by ARM variants ordered by final performance.

Theorems & Definitions (5)

  • Proposition 1: Scaffolded objective optimizes per-step error rate
  • proof : Proof of Proposition \ref{['prop:scaffold-advantage']}
  • Proposition 2: Module improvement on baseline states
  • Proposition 3: Beneficial distribution shift
  • Theorem 1: Monotonic Improvement of Meta-Policy Transfer