Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs

Wei Yang; Jiacheng Pang; Shixuan Li; Paul Bogdan; Stephen Tu; Jesse Thomason

Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs

Wei Yang, Jiacheng Pang, Shixuan Li, Paul Bogdan, Stephen Tu, Jesse Thomason

TL;DR

This work introduces Maestro, a principled two-phase framework that separates divergent exploration from convergent synthesis in multi-agent LLM systems. It couples parallel Execution Agents with a Central Agent and a new reinforcement learning objective, Conditional Listwise Policy Optimization (CLPO), which decouples decision signals from rationales to achieve clean credit assignment and stable convergence. Empirical results across GSM8K, MATH, AMC, MMLU, and HumanEval show consistent, state-of-the-art gains and robustness across backbones, with average improvements around 4–8% and peaks up to 10% in accuracy. The approach advances scalable collaboration in MAS, offering strong improvements for complex reasoning tasks and opening paths toward unified exploration–synthesis objectives and continuous learning for coordination strategies.

Abstract

Multi-agent systems (MAS) built on Large Language Models (LLMs) are being used to approach complex problems and can surpass single model inference. However, their success hinges on navigating a fundamental cognitive tension: the need to balance broad, divergent exploration of the solution space with a principled, convergent synthesis to the optimal solution. Existing paradigms often struggle to manage this duality, leading to premature consensus, error propagation, and a critical credit assignment problem that fails to distinguish between genuine reasoning and superficially plausible arguments. To resolve this core challenge, we propose the Multi-Agent Exploration-Synthesis framework Through Role Orchestration (Maestro), a principled paradigm for collaboration that structurally decouples these cognitive modes. Maestro uses a collective of parallel Execution Agents for diverse exploration and a specialized Central Agent for convergent, evaluative synthesis. To operationalize this critical synthesis phase, we introduce Conditional Listwise Policy Optimization (CLPO), a reinforcement learning objective that disentangles signals for strategic decisions and tactical rationales. By combining decision-focused policy gradients with a list-wise ranking loss over justifications, CLPO achieves clean credit assignment and stronger comparative supervision. Experiments on mathematical reasoning and general problem-solving benchmarks demonstrate that Maestro, coupled with CLPO, consistently outperforms existing state-of-the-art multi-agent approaches, delivering absolute accuracy gains of 6% on average and up to 10% at best.

Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs

TL;DR

Abstract

Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)