Table of Contents
Fetching ...

LERO: LLM-driven Evolutionary framework with Hybrid Rewards and Enhanced Observation for Multi-Agent Reinforcement Learning

Yuan Wei, Xiaohan Shan, Jianmin Li

TL;DR

The paper tackles the twin challenges of credit assignment and partial observability in multi-agent reinforcement learning (MARL) by introducing LERO, an LLM-driven evolutionary framework that jointly optimizes two modular components: a hybrid reward function (HRF) and an observation enhancement function (OEF). An outer evolutionary loop uses an LLM as the evolutionary operator, with a selector module ranking candidate HRFs and OEFs across MARL training runs to guide subsequent generations. The approach is algorithm-agnostic and evaluated on Cooperative Navigation tasks in the Multi-Agent Particle Environment (MPE) across MAPPO, VDN, and QMIX, showing superior performance and faster convergence relative to native baselines and ablated variants. The results demonstrate that LLM-informed design and evolutionary refinement can substantially improve coordination and training efficiency in MARL, suggesting a scalable pathway for integrating language-model reasoning into multi-agent learning systems.

Abstract

Multi-agent reinforcement learning (MARL) faces two critical bottlenecks distinct from single-agent RL: credit assignment in cooperative tasks and partial observability of environmental states. We propose LERO, a framework integrating Large language models (LLMs) with evolutionary optimization to address these MARL-specific challenges. The solution centers on two LLM-generated components: a hybrid reward function that dynamically allocates individual credit through reward decomposition, and an observation enhancement function that augments partial observations with inferred environmental context. An evolutionary algorithm optimizes these components through iterative MARL training cycles, where top-performing candidates guide subsequent LLM generations. Evaluations in Multi-Agent Particle Environments (MPE) demonstrate LERO's superiority over baseline methods, with improved task performance and training efficiency.

LERO: LLM-driven Evolutionary framework with Hybrid Rewards and Enhanced Observation for Multi-Agent Reinforcement Learning

TL;DR

The paper tackles the twin challenges of credit assignment and partial observability in multi-agent reinforcement learning (MARL) by introducing LERO, an LLM-driven evolutionary framework that jointly optimizes two modular components: a hybrid reward function (HRF) and an observation enhancement function (OEF). An outer evolutionary loop uses an LLM as the evolutionary operator, with a selector module ranking candidate HRFs and OEFs across MARL training runs to guide subsequent generations. The approach is algorithm-agnostic and evaluated on Cooperative Navigation tasks in the Multi-Agent Particle Environment (MPE) across MAPPO, VDN, and QMIX, showing superior performance and faster convergence relative to native baselines and ablated variants. The results demonstrate that LLM-informed design and evolutionary refinement can substantially improve coordination and training efficiency in MARL, suggesting a scalable pathway for integrating language-model reasoning into multi-agent learning systems.

Abstract

Multi-agent reinforcement learning (MARL) faces two critical bottlenecks distinct from single-agent RL: credit assignment in cooperative tasks and partial observability of environmental states. We propose LERO, a framework integrating Large language models (LLMs) with evolutionary optimization to address these MARL-specific challenges. The solution centers on two LLM-generated components: a hybrid reward function that dynamically allocates individual credit through reward decomposition, and an observation enhancement function that augments partial observations with inferred environmental context. An evolutionary algorithm optimizes these components through iterative MARL training cycles, where top-performing candidates guide subsequent LLM generations. Evaluations in Multi-Agent Particle Environments (MPE) demonstrate LERO's superiority over baseline methods, with improved task performance and training efficiency.

Paper Structure

This paper contains 21 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The LERO framework follows an iterative process where HRFs and OEF are generated by LLMs based on task descriptions, environment code, and evolution descriptions. In each iteration, a selector module evaluates the performance of these HRFs and OEF, allowing for the selection of the most effective components for MARL training, ultimately enhancing agent adaptability and cooperation.
  • Figure 2: Comparison between LERO Framework and Baseline
  • Figure 3: Results of Hybrid-Reward-only and Observation-Enhanced-only Variants
  • Figure 4: Coverage rate of each iteration in Simple Reference