Joint Intrinsic Motivation for Coordinated Exploration in Multi-Agent Deep Reinforcement Learning
Maxime Toquebiau, Nicolas Bredeche, Faïz Benamar, Jae-Yun Jun
TL;DR
This work tackles sparse rewards in multi-agent deep reinforcement learning by addressing the need to explore coordinated behaviors beyond local observations. It introduces Joint Intrinsic Motivation (JIM), a centralized novelty-based mechanism that evaluates joint observations to incentivize exploration of coordinated strategies under the CTDE paradigm. JIM combines life-long novelty inspired by NovelD with an episodic, ellipsoidal bonus from E3B into a double-timescale intrinsic reward, yielding r_t_JIM that promotes diverse joint trajectories while progressively focusing on the extrinsic task. Empirical results in synthetic relative-overgeneralization settings and continuous multi-agent tasks show that JIM enables reliable discovery of optimal coordinated policies and scales to more agents, outperformingQMIX and local-intrinsic-reward baselines. Overall, the work demonstrates the value of joint-observation-based intrinsic rewards for efficient, scalable coordination in sparse-reward MADRL scenarios.
Abstract
Multi-agent deep reinforcement learning (MADRL) problems often encounter the challenge of sparse rewards. This challenge becomes even more pronounced when coordination among agents is necessary. As performance depends not only on one agent's behavior but rather on the joint behavior of multiple agents, finding an adequate solution becomes significantly harder. In this context, a group of agents can benefit from actively exploring different joint strategies in order to determine the most efficient one. In this paper, we propose an approach for rewarding strategies where agents collectively exhibit novel behaviors. We present JIM (Joint Intrinsic Motivation), a multi-agent intrinsic motivation method that follows the centralized learning with decentralized execution paradigm. JIM rewards joint trajectories based on a centralized measure of novelty designed to function in continuous environments. We demonstrate the strengths of this approach both in a synthetic environment designed to reveal shortcomings of state-of-the-art MADRL methods, and in simulated robotic tasks. Results show that joint exploration is crucial for solving tasks where the optimal strategy requires a high level of coordination.
