Table of Contents
Fetching ...

PIMAEX: Multi-Agent Exploration through Peer Incentivization

Michael Kölle, Johannes Tochtermann, Julian Schönberger, Gerhard Stenzel, Philipp Altmann, Claudia Linnhoff-Popien

TL;DR

This work tackles exploration in multi-agent reinforcement learning under sparse and deceptive rewards by introducing PIMAEX, a peer-incentivized reward that blends intrinsic curiosity with social influence. The framework is extended with PIMAEX-Communication, enabling a discrete communication channel and counterfactual influence analysis to assign credit for peer-driven exploration. The reward comprises three terms, $\alpha$, $\beta$, and $\gamma$, allowing influence-based signals to be combined with intrinsic rewards; experiments in the Consume/Explore environment show that single-term PIMAEX with the $\beta$ component yields the strongest performance and stability, outperforming vanilla PPO and PPO+RND baselines. Overall, PIMAEX demonstrates a flexible, scalable approach to improving MARL exploration via peer influence, with potential applicability across a range of actor-critic methods and environments.

Abstract

While exploration in single-agent reinforcement learning has been studied extensively in recent years, considerably less work has focused on its counterpart in multi-agent reinforcement learning. To address this issue, this work proposes a peer-incentivized reward function inspired by previous research on intrinsic curiosity and influence-based rewards. The \textit{PIMAEX} reward, short for Peer-Incentivized Multi-Agent Exploration, aims to improve exploration in the multi-agent setting by encouraging agents to exert influence over each other to increase the likelihood of encountering novel states. We evaluate the \textit{PIMAEX} reward in conjunction with \textit{PIMAEX-Communication}, a multi-agent training algorithm that employs a communication channel for agents to influence one another. The evaluation is conducted in the \textit{Consume/Explore} environment, a partially observable environment with deceptive rewards, specifically designed to challenge the exploration vs.\ exploitation dilemma and the credit-assignment problem. The results empirically demonstrate that agents using the \textit{PIMAEX} reward with \textit{PIMAEX-Communication} outperform those that do not.

PIMAEX: Multi-Agent Exploration through Peer Incentivization

TL;DR

This work tackles exploration in multi-agent reinforcement learning under sparse and deceptive rewards by introducing PIMAEX, a peer-incentivized reward that blends intrinsic curiosity with social influence. The framework is extended with PIMAEX-Communication, enabling a discrete communication channel and counterfactual influence analysis to assign credit for peer-driven exploration. The reward comprises three terms, , , and , allowing influence-based signals to be combined with intrinsic rewards; experiments in the Consume/Explore environment show that single-term PIMAEX with the component yields the strongest performance and stability, outperforming vanilla PPO and PPO+RND baselines. Overall, PIMAEX demonstrates a flexible, scalable approach to improving MARL exploration via peer influence, with potential applicability across a range of actor-critic methods and environments.

Abstract

While exploration in single-agent reinforcement learning has been studied extensively in recent years, considerably less work has focused on its counterpart in multi-agent reinforcement learning. To address this issue, this work proposes a peer-incentivized reward function inspired by previous research on intrinsic curiosity and influence-based rewards. The \textit{PIMAEX} reward, short for Peer-Incentivized Multi-Agent Exploration, aims to improve exploration in the multi-agent setting by encouraging agents to exert influence over each other to increase the likelihood of encountering novel states. We evaluate the \textit{PIMAEX} reward in conjunction with \textit{PIMAEX-Communication}, a multi-agent training algorithm that employs a communication channel for agents to influence one another. The evaluation is conducted in the \textit{Consume/Explore} environment, a partially observable environment with deceptive rewards, specifically designed to challenge the exploration vs.\ exploitation dilemma and the credit-assignment problem. The results empirically demonstrate that agents using the \textit{PIMAEX} reward with \textit{PIMAEX-Communication} outperform those that do not.
Paper Structure (18 sections, 5 equations, 7 figures, 5 tables)

This paper contains 18 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overall return per episode, i.e., sum of all individual agent returns, of the best training run of each agent category, PPO, PPO+RND, PPO+RND+PIMAEX $\alpha/\beta/\gamma$, from Actor and Evaluator processes run during training. While agents in the Actor processes act according to their stochastic policies, agents in the Evaluator process act according to their greedy policies, i.e., selecting the actions with the highest probability. Results are averaged over three seeds.
  • Figure 2: Per-episode overall return, i.e., sum of all individual agent returns, and per-agent returns, from evaluation runs of the best trained models of each agent category, PPO, PPO+RND, PPO+RND+PIMAEX $\alpha/\beta/\gamma$. Evaluation is run for 200 episodes per seed (= 600 episodes) and results are averaged over all 600 episodes per model. Plots show mean and standard deviation.
  • Figure 3: Final state space coverage of the best training run of each agent category, PPO, PPO+RND, PPO+RND+PIMAEX $\alpha/\beta/\gamma$, for per-agent state space and (global) exploration state space. Results are averaged over three seeds, plots show mean and standard deviation.
  • Figure 4: Per-episode exploration and local agent state space coverage, from evaluation runs of the best trained models of each agent category, PPO, PPO+RND, PPO+RND+PIMAEX $\alpha/\beta/\gamma$. Evaluation is run for 200 episodes per seed (= 600 episodes) and results are averaged over all 600 episodes per model. Plots show mean and standard deviation.
  • Figure 5: Per-episode percentage of consume and explore actions per agent, from evaluation runs of the best trained models of each agent category, PPO, PPO+RND, PPO+RND+PIMAEX $\alpha/\beta/\gamma$. Evaluation is run for 200 episodes per seed (= 600 episodes) and results are averaged over all 600 episodes per model. Plots show mean and standard deviation.
  • ...and 2 more figures