Table of Contents
Fetching ...

Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs

Wei Yang, Jiacheng Pang, Shixuan Li, Paul Bogdan, Stephen Tu, Jesse Thomason

TL;DR

This work introduces Maestro, a principled two-phase framework that separates divergent exploration from convergent synthesis in multi-agent LLM systems. It couples parallel Execution Agents with a Central Agent and a new reinforcement learning objective, Conditional Listwise Policy Optimization (CLPO), which decouples decision signals from rationales to achieve clean credit assignment and stable convergence. Empirical results across GSM8K, MATH, AMC, MMLU, and HumanEval show consistent, state-of-the-art gains and robustness across backbones, with average improvements around 4–8% and peaks up to 10% in accuracy. The approach advances scalable collaboration in MAS, offering strong improvements for complex reasoning tasks and opening paths toward unified exploration–synthesis objectives and continuous learning for coordination strategies.

Abstract

Multi-agent systems (MAS) built on Large Language Models (LLMs) are being used to approach complex problems and can surpass single model inference. However, their success hinges on navigating a fundamental cognitive tension: the need to balance broad, divergent exploration of the solution space with a principled, convergent synthesis to the optimal solution. Existing paradigms often struggle to manage this duality, leading to premature consensus, error propagation, and a critical credit assignment problem that fails to distinguish between genuine reasoning and superficially plausible arguments. To resolve this core challenge, we propose the Multi-Agent Exploration-Synthesis framework Through Role Orchestration (Maestro), a principled paradigm for collaboration that structurally decouples these cognitive modes. Maestro uses a collective of parallel Execution Agents for diverse exploration and a specialized Central Agent for convergent, evaluative synthesis. To operationalize this critical synthesis phase, we introduce Conditional Listwise Policy Optimization (CLPO), a reinforcement learning objective that disentangles signals for strategic decisions and tactical rationales. By combining decision-focused policy gradients with a list-wise ranking loss over justifications, CLPO achieves clean credit assignment and stronger comparative supervision. Experiments on mathematical reasoning and general problem-solving benchmarks demonstrate that Maestro, coupled with CLPO, consistently outperforms existing state-of-the-art multi-agent approaches, delivering absolute accuracy gains of 6% on average and up to 10% at best.

Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs

TL;DR

This work introduces Maestro, a principled two-phase framework that separates divergent exploration from convergent synthesis in multi-agent LLM systems. It couples parallel Execution Agents with a Central Agent and a new reinforcement learning objective, Conditional Listwise Policy Optimization (CLPO), which decouples decision signals from rationales to achieve clean credit assignment and stable convergence. Empirical results across GSM8K, MATH, AMC, MMLU, and HumanEval show consistent, state-of-the-art gains and robustness across backbones, with average improvements around 4–8% and peaks up to 10% in accuracy. The approach advances scalable collaboration in MAS, offering strong improvements for complex reasoning tasks and opening paths toward unified exploration–synthesis objectives and continuous learning for coordination strategies.

Abstract

Multi-agent systems (MAS) built on Large Language Models (LLMs) are being used to approach complex problems and can surpass single model inference. However, their success hinges on navigating a fundamental cognitive tension: the need to balance broad, divergent exploration of the solution space with a principled, convergent synthesis to the optimal solution. Existing paradigms often struggle to manage this duality, leading to premature consensus, error propagation, and a critical credit assignment problem that fails to distinguish between genuine reasoning and superficially plausible arguments. To resolve this core challenge, we propose the Multi-Agent Exploration-Synthesis framework Through Role Orchestration (Maestro), a principled paradigm for collaboration that structurally decouples these cognitive modes. Maestro uses a collective of parallel Execution Agents for diverse exploration and a specialized Central Agent for convergent, evaluative synthesis. To operationalize this critical synthesis phase, we introduce Conditional Listwise Policy Optimization (CLPO), a reinforcement learning objective that disentangles signals for strategic decisions and tactical rationales. By combining decision-focused policy gradients with a list-wise ranking loss over justifications, CLPO achieves clean credit assignment and stronger comparative supervision. Experiments on mathematical reasoning and general problem-solving benchmarks demonstrate that Maestro, coupled with CLPO, consistently outperforms existing state-of-the-art multi-agent approaches, delivering absolute accuracy gains of 6% on average and up to 10% at best.

Paper Structure

This paper contains 25 sections, 10 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of the Maestro framework. First, $N$ execution agents each generate $K$ candidate reasoning-answer pairs, forming a broad solution pool. A central agent then governs exploitation by applying discriminative selection over the candidate set. The decision policy $\pi_\theta$ is trained under Conditional Listwise Policy Optimization (CLPO), which integrates a choice-aware objective, a reasoning-rank objective, and regularization terms including KL divergence and entropy.
  • Figure 2: Comparison of collaboration paradigms. Left: task accuracy on AMC and GSM8K across different central coordination strategies. Right: performance decomposed into coverage and identification rates; central selection transforms diverse reasoning into reliable outcomes.
  • Figure 3: Ablation studies on central selection inputs and CLPO losses. Left: Reason-only, Answer-only, and Both settings when passing candidate information to the central selector. Right: contributions of loss components studied by removing choice or reasoning supervision.
  • Figure 4: Effect of collaborative scale on reasoning performance. Left: accuracy with varying agent numbers. Right: impact of sampling multiplicity. In each, the solid line corresponds to GSM8K accuracy (right axis) and the dashed line corresponds to AMC accuracy (left axis).
  • Figure 5: Sankey diagram illustrating performance on AMC and GSM8K under different central coordination strategies. Each flow decomposes accuracy into coverage and identification outcomes, showing how centralized selection more effectively converts diverse reasoning into correct solutions.
  • ...and 1 more figures