PIMAEX: Multi-Agent Exploration through Peer Incentivization
Michael Kölle, Johannes Tochtermann, Julian Schönberger, Gerhard Stenzel, Philipp Altmann, Claudia Linnhoff-Popien
TL;DR
This work tackles exploration in multi-agent reinforcement learning under sparse and deceptive rewards by introducing PIMAEX, a peer-incentivized reward that blends intrinsic curiosity with social influence. The framework is extended with PIMAEX-Communication, enabling a discrete communication channel and counterfactual influence analysis to assign credit for peer-driven exploration. The reward comprises three terms, $\alpha$, $\beta$, and $\gamma$, allowing influence-based signals to be combined with intrinsic rewards; experiments in the Consume/Explore environment show that single-term PIMAEX with the $\beta$ component yields the strongest performance and stability, outperforming vanilla PPO and PPO+RND baselines. Overall, PIMAEX demonstrates a flexible, scalable approach to improving MARL exploration via peer influence, with potential applicability across a range of actor-critic methods and environments.
Abstract
While exploration in single-agent reinforcement learning has been studied extensively in recent years, considerably less work has focused on its counterpart in multi-agent reinforcement learning. To address this issue, this work proposes a peer-incentivized reward function inspired by previous research on intrinsic curiosity and influence-based rewards. The \textit{PIMAEX} reward, short for Peer-Incentivized Multi-Agent Exploration, aims to improve exploration in the multi-agent setting by encouraging agents to exert influence over each other to increase the likelihood of encountering novel states. We evaluate the \textit{PIMAEX} reward in conjunction with \textit{PIMAEX-Communication}, a multi-agent training algorithm that employs a communication channel for agents to influence one another. The evaluation is conducted in the \textit{Consume/Explore} environment, a partially observable environment with deceptive rewards, specifically designed to challenge the exploration vs.\ exploitation dilemma and the credit-assignment problem. The results empirically demonstrate that agents using the \textit{PIMAEX} reward with \textit{PIMAEX-Communication} outperform those that do not.
