Table of Contents
Fetching ...

MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization

Boyuan Wu

TL;DR

MAESTRO reframes cooperative MARL training by using a large language model as an offline training architect to generate semantic curricula and reward templates. The approach couples an adaptive, domain-specific curriculum with template-based LLM reward shaping and prior-policy regularization on top of MADDPG, improving stability and practical traffic metrics in a 16-intersection urban network. Across ablations, curriculum design emerges as the dominant lever, while constrained, template-based rewards yield robust, risk-adjusted gains with modest episode-return improvements. The work demonstrates that LLMs can serve as high-level trainers for MARL, enabling scalable, low-latency deployment while leveraging rich semantic priors for more robust learning in non-stationary environments.

Abstract

Cooperative Multi-Agent Reinforcement Learning (MARL) faces two major design bottlenecks: crafting dense reward functions and constructing curricula that avoid local optima in high-dimensional, non-stationary environments. Existing approaches rely on fixed heuristics or use Large Language Models (LLMs) directly in the control loop, which is costly and unsuitable for real-time systems. We propose MAESTRO (Multi-Agent Environment Shaping through Task and Reward Optimization), a framework that moves the LLM outside the execution loop and uses it as an offline training architect. MAESTRO introduces two generative components: (i) a semantic curriculum generator that creates diverse, performance-driven traffic scenarios, and (ii) an automated reward synthesizer that produces executable Python reward functions adapted to evolving curriculum difficulty. These components guide a standard MARL backbone (MADDPG) without increasing inference cost at deployment. We evaluate MAESTRO on large-scale traffic signal control (Hangzhou, 16 intersections) and conduct controlled ablations. Results show that combining LLM-generated curricula with LLM-generated reward shaping yields improved performance and stability. Across four seeds, the full system achieves +4.0% higher mean return (163.26 vs. 156.93) and 2.2% better risk-adjusted performance (Sharpe 1.53 vs. 0.70) over a strong curriculum baseline. These findings highlight LLMs as effective high-level designers for cooperative MARL training.

MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization

TL;DR

MAESTRO reframes cooperative MARL training by using a large language model as an offline training architect to generate semantic curricula and reward templates. The approach couples an adaptive, domain-specific curriculum with template-based LLM reward shaping and prior-policy regularization on top of MADDPG, improving stability and practical traffic metrics in a 16-intersection urban network. Across ablations, curriculum design emerges as the dominant lever, while constrained, template-based rewards yield robust, risk-adjusted gains with modest episode-return improvements. The work demonstrates that LLMs can serve as high-level trainers for MARL, enabling scalable, low-latency deployment while leveraging rich semantic priors for more robust learning in non-stationary environments.

Abstract

Cooperative Multi-Agent Reinforcement Learning (MARL) faces two major design bottlenecks: crafting dense reward functions and constructing curricula that avoid local optima in high-dimensional, non-stationary environments. Existing approaches rely on fixed heuristics or use Large Language Models (LLMs) directly in the control loop, which is costly and unsuitable for real-time systems. We propose MAESTRO (Multi-Agent Environment Shaping through Task and Reward Optimization), a framework that moves the LLM outside the execution loop and uses it as an offline training architect. MAESTRO introduces two generative components: (i) a semantic curriculum generator that creates diverse, performance-driven traffic scenarios, and (ii) an automated reward synthesizer that produces executable Python reward functions adapted to evolving curriculum difficulty. These components guide a standard MARL backbone (MADDPG) without increasing inference cost at deployment. We evaluate MAESTRO on large-scale traffic signal control (Hangzhou, 16 intersections) and conduct controlled ablations. Results show that combining LLM-generated curricula with LLM-generated reward shaping yields improved performance and stability. Across four seeds, the full system achieves +4.0% higher mean return (163.26 vs. 156.93) and 2.2% better risk-adjusted performance (Sharpe 1.53 vs. 0.70) over a strong curriculum baseline. These findings highlight LLMs as effective high-level designers for cooperative MARL training.

Paper Structure

This paper contains 75 sections, 7 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: MAESTRO workflow. The LLM (Architect) generates curriculum contexts and parameterizes executable templates for reward shaping and prior policy logits. MARL agents (Learners) train under this shaped environment using MADDPG. Performance feedback drives curriculum updates and, in A7, periodic regeneration of reward and policy templates. LLM calls occur only during training, not deployment.
  • Figure 2: Conceptual foundation of MAESTRO. The LLM (Architect) generates task-specific components---curriculum contexts, reward parameters, and prior policy parameters---that are injected into the MARL training loop. The deployed controller is an unmodified MADDPG policy; LLM inference is restricted to training.
  • Figure 3: Training performance over 200 episodes. Lines show mean episode return; bands show 95% confidence intervals. A7 exhibits the narrowest band, indicating high stability across seeds.
  • Figure 4: Normalized comparison across five metrics (0--1 scale). A7 balances final return, stability, learning speed, sample efficiency, and robustness. A8 dominates in peak return but is less stable; A2 is consistently weaker.
  • Figure 5: Mean returns and standard deviations. A8 yields the largest raw gain; A7 offers smaller gains with much tighter error bars.
  • ...and 3 more figures