Table of Contents
Fetching ...

Role Play: Learning Adaptive Role-Specific Strategies in Multi-Agent Interactions

Weifan Long, Wen Wen, Peng Zhai, Lihua Zhang

TL;DR

RP theoretically proves that an approximate optimal policy can be achieved by optimizing the expected cumulative reward relative to an approximate role-based policy, and demonstrates its robustness and adaptability in complex environments.

Abstract

Zero-shot coordination problem in multi-agent reinforcement learning (MARL), which requires agents to adapt to unseen agents, has attracted increasing attention. Traditional approaches often rely on the Self-Play (SP) framework to generate a diverse set of policies in a policy pool, which serves to improve the generalization capability of the final agent. However, these frameworks may struggle to capture the full spectrum of potential strategies, especially in real-world scenarios that demand agents balance cooperation with competition. In such settings, agents need strategies that can adapt to varying and often conflicting goals. Drawing inspiration from Social Value Orientation (SVO)-where individuals maintain stable value orientations during interactions with others-we propose a novel framework called \emph{Role Play} (RP). RP employs role embeddings to transform the challenge of policy diversity into a more manageable diversity of roles. It trains a common policy with role embedding observations and employs a role predictor to estimate the joint role embeddings of other agents, helping the learning agent adapt to its assigned role. We theoretically prove that an approximate optimal policy can be achieved by optimizing the expected cumulative reward relative to an approximate role-based policy. Experimental results in both cooperative (Overcooked) and mixed-motive games (Harvest, CleanUp) reveal that RP consistently outperforms strong baselines when interacting with unseen agents, highlighting its robustness and adaptability in complex environments.

Role Play: Learning Adaptive Role-Specific Strategies in Multi-Agent Interactions

TL;DR

RP theoretically proves that an approximate optimal policy can be achieved by optimizing the expected cumulative reward relative to an approximate role-based policy, and demonstrates its robustness and adaptability in complex environments.

Abstract

Zero-shot coordination problem in multi-agent reinforcement learning (MARL), which requires agents to adapt to unseen agents, has attracted increasing attention. Traditional approaches often rely on the Self-Play (SP) framework to generate a diverse set of policies in a policy pool, which serves to improve the generalization capability of the final agent. However, these frameworks may struggle to capture the full spectrum of potential strategies, especially in real-world scenarios that demand agents balance cooperation with competition. In such settings, agents need strategies that can adapt to varying and often conflicting goals. Drawing inspiration from Social Value Orientation (SVO)-where individuals maintain stable value orientations during interactions with others-we propose a novel framework called \emph{Role Play} (RP). RP employs role embeddings to transform the challenge of policy diversity into a more manageable diversity of roles. It trains a common policy with role embedding observations and employs a role predictor to estimate the joint role embeddings of other agents, helping the learning agent adapt to its assigned role. We theoretically prove that an approximate optimal policy can be achieved by optimizing the expected cumulative reward relative to an approximate role-based policy. Experimental results in both cooperative (Overcooked) and mixed-motive games (Harvest, CleanUp) reveal that RP consistently outperforms strong baselines when interacting with unseen agents, highlighting its robustness and adaptability in complex environments.

Paper Structure

This paper contains 19 sections, 2 theorems, 18 equations, 10 figures, 4 tables, 1 algorithm.

Key Result

Theorem 4.1

For a finite MDP with $T$ time steps and a specific role policy $\pi(z)$, if any random policy $\pi^\prime$ is $\epsilon$-close to the role policy $\pi(z^\prime)$, then we have

Figures (10)

  • Figure 1: A graphical representation of the SVO framework.
  • Figure 2: Role play framework: Different agents share the same policy network but with different role embedding inputs. The role policy prediction network is trained to predict the role embeddings of other agents. The learning agent optimizes its policy based on the predicted roles.
  • Figure 3: $RL^2$ based RP
  • Figure 4: Layouts in Overcooked.
  • Figure 5: Zero-shot evaluation results with script agents on Asymmetric Advantages and Cramped Room.
  • ...and 5 more figures

Theorems & Definitions (5)

  • Definition 4.1
  • Theorem 4.1
  • proof
  • Theorem 4.1
  • proof