Joint Intrinsic Motivation for Coordinated Exploration in Multi-Agent Deep Reinforcement Learning

Maxime Toquebiau; Nicolas Bredeche; Faïz Benamar; Jae-Yun Jun

Joint Intrinsic Motivation for Coordinated Exploration in Multi-Agent Deep Reinforcement Learning

Maxime Toquebiau, Nicolas Bredeche, Faïz Benamar, Jae-Yun Jun

TL;DR

This work tackles sparse rewards in multi-agent deep reinforcement learning by addressing the need to explore coordinated behaviors beyond local observations. It introduces Joint Intrinsic Motivation (JIM), a centralized novelty-based mechanism that evaluates joint observations to incentivize exploration of coordinated strategies under the CTDE paradigm. JIM combines life-long novelty inspired by NovelD with an episodic, ellipsoidal bonus from E3B into a double-timescale intrinsic reward, yielding r_t_JIM that promotes diverse joint trajectories while progressively focusing on the extrinsic task. Empirical results in synthetic relative-overgeneralization settings and continuous multi-agent tasks show that JIM enables reliable discovery of optimal coordinated policies and scales to more agents, outperformingQMIX and local-intrinsic-reward baselines. Overall, the work demonstrates the value of joint-observation-based intrinsic rewards for efficient, scalable coordination in sparse-reward MADRL scenarios.

Abstract

Multi-agent deep reinforcement learning (MADRL) problems often encounter the challenge of sparse rewards. This challenge becomes even more pronounced when coordination among agents is necessary. As performance depends not only on one agent's behavior but rather on the joint behavior of multiple agents, finding an adequate solution becomes significantly harder. In this context, a group of agents can benefit from actively exploring different joint strategies in order to determine the most efficient one. In this paper, we propose an approach for rewarding strategies where agents collectively exhibit novel behaviors. We present JIM (Joint Intrinsic Motivation), a multi-agent intrinsic motivation method that follows the centralized learning with decentralized execution paradigm. JIM rewards joint trajectories based on a centralized measure of novelty designed to function in continuous environments. We demonstrate the strengths of this approach both in a synthetic environment designed to reveal shortcomings of state-of-the-art MADRL methods, and in simulated robotic tasks. Results show that joint exploration is crucial for solving tasks where the optimal strategy requires a high level of coordination.

Joint Intrinsic Motivation for Coordinated Exploration in Multi-Agent Deep Reinforcement Learning

TL;DR

Abstract

Paper Structure (30 sections, 12 equations, 13 figures, 2 tables)

This paper contains 30 sections, 12 equations, 13 figures, 2 tables.

Introduction
Related Works
Background
Dec-POMDP
Intrinsic rewards
Random Network Distillation (RND)
Novelty Difference (NovelD)
Exploration via Elliptical Episodic Bonuses (E3B)
Algorithm
The challenge of coordinated actions
Double-timescale Intrinsic Reward
The Joint Intrinsic Motivation algorithm
Implementation details
Experiments
Addressing relative overgeneralization
...and 15 more sections

Figures (13)

Figure 1: Two examples of relative overgeneralization: (a) payoff matrix of a social dilemma game, (b) heat-map of the reward function in the $\mathtt{rel\_overgen}$ environment for two agents, with $D=40$ and $\delta=30$. Details in Section \ref{['sec6.1']}.
Figure 2: Architecture for the Joint Intrinsic Motivation (JIM) algorithm. JIM has only one intrinsic motivation module for the whole multi-agent system, computing novelty of the joint observation $\mathbf{o}_t$. However, agents only use their local observation to choose their action.
Figure 3: Performance of variants of QMIX in the $\mathtt{rel\_overgen}$ environment, with three levels of difficulty. On top, we show the heat maps representing the reward function in each instance, where the difficulty is dictated by the width coefficient of the optimal reward spike $\delta$ (as defined in Eq. \ref{['eq:RelOvergen']}). Increasing $\delta$ leads to a smaller optimal reward spike. Below is shown the performance during training of QMIX with no intrinsic reward (QMIX), local intrinsic motivation (QMIX+LIM), and joint intrinsic motivation (QMIX+JIM) (mean and standard deviation shown for 15 runs each). We see that a slight decrease in the size of the optimal reward spike results in a considerable increase in the difficulty of the task.
Figure 4: Screenshots of our custom tasks in the MPE, agents are the small grey circles. (a) cooperative box pushing scenario: agents must deliver the object (green circle in the middle) to the landmark in the bottom right corner. (b) coordinated placement scenario: agents must navigate to position themselves on the colored circles representing landmarks.
Figure 5: Training curves of the three variants of QMIX in the cooperative box pushing task, with the mean and standard deviation across 11 runs each.
...and 8 more figures

Joint Intrinsic Motivation for Coordinated Exploration in Multi-Agent Deep Reinforcement Learning

TL;DR

Abstract

Joint Intrinsic Motivation for Coordinated Exploration in Multi-Agent Deep Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (13)