Table of Contents
Fetching ...

LangMARL: Natural Language Multi-Agent Reinforcement Learning

Huaiyuan Yao, Longchao Da, Xiaoou Liu, Charles Fleming, Tianlong Chen, Hua Wei

Abstract

Large language model (LLM) agents struggle to autonomously evolve coordination strategies in dynamic environments, largely because coarse global outcomes obscure the causal signals needed for local policy refinement. We identify this bottleneck as a multi-agent credit assignment problem, which has long been studied in classical multi-agent reinforcement learning (MARL) but remains underaddressed in LLM-based systems. Building on this observation, we propose LangMARL, a framework that brings credit assignment and policy gradient evolution from cooperative MARL into the language space. LangMARL introduces agent-level language credit assignment, pioneers gradient evolution in language space for policy improvement, and summarizes task-relevant causal relations from replayed trajectories to provide dense feedback and improve convergence under sparse rewards. Extensive experiments across diverse cooperative multi-agent tasks demonstrate improved sample efficiency, interpretability, and strong generalization.

LangMARL: Natural Language Multi-Agent Reinforcement Learning

Abstract

Large language model (LLM) agents struggle to autonomously evolve coordination strategies in dynamic environments, largely because coarse global outcomes obscure the causal signals needed for local policy refinement. We identify this bottleneck as a multi-agent credit assignment problem, which has long been studied in classical multi-agent reinforcement learning (MARL) but remains underaddressed in LLM-based systems. Building on this observation, we propose LangMARL, a framework that brings credit assignment and policy gradient evolution from cooperative MARL into the language space. LangMARL introduces agent-level language credit assignment, pioneers gradient evolution in language space for policy improvement, and summarizes task-relevant causal relations from replayed trajectories to provide dense feedback and improve convergence under sparse rewards. Extensive experiments across diverse cooperative multi-agent tasks demonstrate improved sample efficiency, interpretability, and strong generalization.

Paper Structure

This paper contains 39 sections, 8 equations, 12 figures, 2 tables, 2 algorithms.

Figures (12)

  • Figure 1: Challenges in multi-agent credit assignment. Global evaluation fails to pinpoint individual contributions, leading to ambiguous reflections. LangMARL addresses this by decomposing team performance into agent-specific credits.
  • Figure 2: An Easy-to-Use Toolkit for LangMARL. LangMARL mirrors the syntax and abstractions of classical MARL libraries (e.g., TorchRL), redefining core components in natural language space, making LLM-based multi-agent optimization as straightforward to implement as standard deep RL pipelines.
  • Figure 3: The LangMARL System Pipeline. The framework follows a CTDE paradigm: (i) Language Policy Actors execute decentralized actions, (ii) a Centralized Language Critic assigns trajectory-level causal credits, and (iii) the Language Policy Optimizer updates policies in natural language.
  • Figure 4: Impact of credit assignment on the convergence quality of LangMARL. The learning curves across five benchmark tasks demonstrate that credit assignment is pivotal for efficient policy optimization and superior final performance. Without this mechanism (green lines), the models exhibit sub-optimal learning rates and significant instability, particularly in complex reasoning and multi-agent coordination scenarios.
  • Figure 5: Emergent role specialization in LangMARL. Top: In the initial symmetric setup, both agents operate as generic problem solvers without predefined role differentiation. Bottom: After iterative trajectory-level credit assignment and language-based policy optimization, agents self-organize into complementary roles, such as in the coding scenario, Agent 1 specializing in structured implementation and Agent 2 in critical evaluation and refinement. This division of labor is not explicitly specified in the prompt but emerges through centralized language credits.
  • ...and 7 more figures