Table of Contents
Fetching ...

Beyond Joint Demonstrations: Personalized Expert Guidance for Efficient Multi-Agent Reinforcement Learning

Peihong Yu, Manav Mishra, Alec Koppel, Carl Busart, Priya Narayan, Dinesh Manocha, Amrit Bedi, Pratap Tokekar

TL;DR

PegMARL tackles efficient exploration in multi-agent reinforcement learning by leveraging personalized expert demonstrations for each agent type, addressing the impracticality of collecting joint demonstrations. It formalizes an occupancy-measure objective that combines long-term return with a Jensen-Shannon divergence term between local occupancies and personalized expert occupancies, expressed as $\frac{1}{N}\sum_{i=1}^N ( \langle \lambda^{\pi_\theta}_i, r \rangle - \eta \mathbb{D}_{JS}( \lambda^{\pi_\theta}_i || \lambda^{\pi_{E_i}} ))$, and replaces the divergence with adversarial discriminators $D_{\phi_i}$ and $D_{\overline{\phi_i}}$ to produce reshaped rewards $\hat{r}_i = r - \eta D_{\overline{\phi_i}}(s_i,a_i,s_i') \log(1 - D_{\phi_i}(s_i,a_i))$. This approach enables leveraging personalized demonstrations for cooperative MARL, while still benefiting from joint demonstrations when available, and demonstrates strong performance across discrete and continuous domains, robustness to suboptimal demonstrations, and scalability with the number of agents. The method offers practical flexibility by accommodating demonstrations from non-co-trained policies and integrates seamlessly with policy-gradient methods such as MAPPO. Overall, PegMARL advances data-efficient MARL by combining per-agent demonstration signals with learned transition-level guidance to foster cooperation in heterogeneous teams.

Abstract

Multi-Agent Reinforcement Learning (MARL) algorithms face the challenge of efficient exploration due to the exponential increase in the size of the joint state-action space. While demonstration-guided learning has proven beneficial in single-agent settings, its direct applicability to MARL is hindered by the practical difficulty of obtaining joint expert demonstrations. In this work, we introduce a novel concept of personalized expert demonstrations, tailored for each individual agent or, more broadly, each individual type of agent within a heterogeneous team. These demonstrations solely pertain to single-agent behaviors and how each agent can achieve personal goals without encompassing any cooperative elements, thus naively imitating them will not achieve cooperation due to potential conflicts. To this end, we propose an approach that selectively utilizes personalized expert demonstrations as guidance and allows agents to learn to cooperate, namely personalized expert-guided MARL (PegMARL). This algorithm utilizes two discriminators: the first provides incentives based on the alignment of individual agent behavior with demonstrations, and the second regulates incentives based on whether the behaviors lead to the desired outcome. We evaluate PegMARL using personalized demonstrations in both discrete and continuous environments. The experimental results demonstrate that PegMARL outperforms state-of-the-art MARL algorithms in solving coordinated tasks, achieving strong performance even when provided with suboptimal personalized demonstrations. We also showcase PegMARL's capability of leveraging joint demonstrations in the StarCraft scenario and converging effectively even with demonstrations from non-co-trained policies.

Beyond Joint Demonstrations: Personalized Expert Guidance for Efficient Multi-Agent Reinforcement Learning

TL;DR

PegMARL tackles efficient exploration in multi-agent reinforcement learning by leveraging personalized expert demonstrations for each agent type, addressing the impracticality of collecting joint demonstrations. It formalizes an occupancy-measure objective that combines long-term return with a Jensen-Shannon divergence term between local occupancies and personalized expert occupancies, expressed as , and replaces the divergence with adversarial discriminators and to produce reshaped rewards . This approach enables leveraging personalized demonstrations for cooperative MARL, while still benefiting from joint demonstrations when available, and demonstrates strong performance across discrete and continuous domains, robustness to suboptimal demonstrations, and scalability with the number of agents. The method offers practical flexibility by accommodating demonstrations from non-co-trained policies and integrates seamlessly with policy-gradient methods such as MAPPO. Overall, PegMARL advances data-efficient MARL by combining per-agent demonstration signals with learned transition-level guidance to foster cooperation in heterogeneous teams.

Abstract

Multi-Agent Reinforcement Learning (MARL) algorithms face the challenge of efficient exploration due to the exponential increase in the size of the joint state-action space. While demonstration-guided learning has proven beneficial in single-agent settings, its direct applicability to MARL is hindered by the practical difficulty of obtaining joint expert demonstrations. In this work, we introduce a novel concept of personalized expert demonstrations, tailored for each individual agent or, more broadly, each individual type of agent within a heterogeneous team. These demonstrations solely pertain to single-agent behaviors and how each agent can achieve personal goals without encompassing any cooperative elements, thus naively imitating them will not achieve cooperation due to potential conflicts. To this end, we propose an approach that selectively utilizes personalized expert demonstrations as guidance and allows agents to learn to cooperate, namely personalized expert-guided MARL (PegMARL). This algorithm utilizes two discriminators: the first provides incentives based on the alignment of individual agent behavior with demonstrations, and the second regulates incentives based on whether the behaviors lead to the desired outcome. We evaluate PegMARL using personalized demonstrations in both discrete and continuous environments. The experimental results demonstrate that PegMARL outperforms state-of-the-art MARL algorithms in solving coordinated tasks, achieving strong performance even when provided with suboptimal personalized demonstrations. We also showcase PegMARL's capability of leveraging joint demonstrations in the StarCraft scenario and converging effectively even with demonstrations from non-co-trained policies.
Paper Structure (22 sections, 9 equations, 21 figures, 5 tables, 1 algorithm)

This paper contains 22 sections, 9 equations, 21 figures, 5 tables, 1 algorithm.

Figures (21)

  • Figure 1: Joint demonstrations are costly to collect but offer rich information on collaborative behaviors. Personalized demonstrations are easier to collect, but solely focus on individual agent goals, so they lack cooperative elements.
  • Figure 2: An example of utilizing personalized demonstrations to learn cooperative multi-agent policies. To learn successful cooperation, the agents are required not only to imitate the demonstrations to achieve personal goals but also to learn how to avoid conflicts and collaborate. We visualize the state visitation frequency of the personalized demonstrations and the joint policies learned by our algorithm and MAPPO, where a darker color means a higher value. We observe that the demonstrations guide the agents in exploring the state space more efficiently than in MAPPO.
  • Figure 3: When joint demonstrations are sampled from co-trained policies, the agents' behaviors exhibit compatibility. In contrast, personalized demonstrations solely focus on how each agent achieves its individual goal and lack cooperative elements, potentially leading to conflicts.
  • Figure 4: A motivating example illustrating the imitation of personalized demonstrations in a multi-agent environment. The primary technical challenge lies in the discrepancy between the transition dynamics in the personalized MDP and the local transition dynamics for each agent in the multi-agent environment.
  • Figure 5: The lava scenario: the agents are homogeneous, aiming to reach corresponding diagonal positions without entering the lava. The episode ends if any agent steps into the lava.
  • ...and 16 more figures