Table of Contents
Fetching ...

Large Language Model Guided Incentive Aware Reward Design for Cooperative Multi-Agent Reinforcement Learning

Dogan Urgun, Gokhan Gungor

Abstract

Designing effective auxiliary rewards for cooperative multi-agent systems remains a challenging task. Misaligned incentives risk inducing suboptimal coordination, especially when sparse task feedback fails to provide sufficient grounding. This study introduces an automated reward design framework that leverages large language models to synthesize executable reward programs from environment instrumentation. The procedure constrains candidate programs within a formal validity envelope and evaluates their efficacy by training policies from scratch under a fixed computational budget. Selection across generations depends exclusively on the sparse task return. The framework is evaluated across four distinct Overcooked-AI layouts characterized by varied corridor congestion, handoff dependencies, and structural asymmetries. Iterative search generations consistently yield superior task returns and delivery counts, with the most pronounced gains occurring in environments dominated by interaction bottlenecks. Diagnostic analysis of the synthesized shaping components indicates increased interdependence in action selection and improved signal alignment in coordination-intensive tasks. These results demonstrate that the search for objective-grounded reward programs can mitigate the burden of manual engineering while producing shaping signals compatible with cooperative learning under finite budgets.

Large Language Model Guided Incentive Aware Reward Design for Cooperative Multi-Agent Reinforcement Learning

Abstract

Designing effective auxiliary rewards for cooperative multi-agent systems remains a challenging task. Misaligned incentives risk inducing suboptimal coordination, especially when sparse task feedback fails to provide sufficient grounding. This study introduces an automated reward design framework that leverages large language models to synthesize executable reward programs from environment instrumentation. The procedure constrains candidate programs within a formal validity envelope and evaluates their efficacy by training policies from scratch under a fixed computational budget. Selection across generations depends exclusively on the sparse task return. The framework is evaluated across four distinct Overcooked-AI layouts characterized by varied corridor congestion, handoff dependencies, and structural asymmetries. Iterative search generations consistently yield superior task returns and delivery counts, with the most pronounced gains occurring in environments dominated by interaction bottlenecks. Diagnostic analysis of the synthesized shaping components indicates increased interdependence in action selection and improved signal alignment in coordination-intensive tasks. These results demonstrate that the search for objective-grounded reward programs can mitigate the burden of manual engineering while producing shaping signals compatible with cooperative learning under finite budgets.

Paper Structure

This paper contains 18 sections, 10 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of the proposed autonomous reward search framework. The system establishes a closed-loop optimization process where an LLM-based reward engineer agent iteratively refines reward candidates based on MAPPO evaluations and diagnostic feedback, grounded in task specifications and archive data.
  • Figure 2: Overview of the CTDE paradigm for multi-agent reinforcement learning. (a) Centralized training: a shared critic ($\mathbf{V}$) utilizes global state information and joint observation buffers to guide policy updates via training feedback. (b) Decentralized execution: individual actors ($\pi_i$) rely exclusively on local observations for action selection, ensuring scalability in partially observable environments.
  • Figure 3: Overcooked-AI layouts and coordination challenges. Each environment isolates specific facets of multi-agent cooperation: (a) Cramped Room evaluates spatial efficiency and collision avoidance in shared workspaces. (b) Forced Coordination necessitates strict functional specialization and inter-agent hand-offs. (c) Coordination Ring tests movement synchronization to prevent bottlenecks in circular corridors. (d) Asymmetric Advantages introduces resource-specific disparities, requiring strategic role delegation based on proximity to dispensers.
  • Figure 4: Learning curves of evaluation sparse return $J$. Performance comparison between the MAPPO baseline and the selected candidates from the first and second generations across four layouts: (a) Cramped Room, (b) Forced Coordination, (c) Coordination Ring, and (d) Asymmetric Advantages. Shaded regions indicate variability across evaluation episodes.
  • Figure 5: Candidate promotion diagram. Nodes summarize evaluated candidates and objective scores, and edges indicate the promotion path used to condition subsequent generations.
  • ...and 1 more figures