Table of Contents
Fetching ...

Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks

Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, Shimon Whiteson

TL;DR

DDRQN tackles the problem of learning communication protocols among multiple agents operating under partial observability. It introduces three architectural innovations—last-action inputs, inter-agent weight sharing, and disabling experience replay—to enable centralized learning of decentralized policies, producing a shared Q-function over private histories and agent IDs. Empirical results on Hats and Switch riddles show DDRQN learns effective coordination and emergent communication, outperforming baselines and revealing interpretable strategies; ablation studies confirm each component's critical role. The work demonstrates for the first time that deep reinforcement learning can autonomously discover communication protocols in multi-agent settings, with implications for scalable coordination in real-world, partially observable domains.

Abstract

We propose deep distributed recurrent Q-networks (DDRQN), which enable teams of agents to learn to solve communication-based coordination tasks. In these tasks, the agents are not given any pre-designed communication protocol. Therefore, in order to successfully communicate, they must first automatically develop and agree upon their own communication protocol. We present empirical results on two multi-agent learning problems based on well-known riddles, demonstrating that DDRQN can successfully solve such tasks and discover elegant communication protocols to do so. To our knowledge, this is the first time deep reinforcement learning has succeeded in learning communication protocols. In addition, we present ablation experiments that confirm that each of the main components of the DDRQN architecture are critical to its success.

Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks

TL;DR

DDRQN tackles the problem of learning communication protocols among multiple agents operating under partial observability. It introduces three architectural innovations—last-action inputs, inter-agent weight sharing, and disabling experience replay—to enable centralized learning of decentralized policies, producing a shared Q-function over private histories and agent IDs. Empirical results on Hats and Switch riddles show DDRQN learns effective coordination and emergent communication, outperforming baselines and revealing interpretable strategies; ablation studies confirm each component's critical role. The work demonstrates for the first time that deep reinforcement learning can autonomously discover communication protocols in multi-agent settings, with implications for scalable coordination in real-world, partially observable domains.

Abstract

We propose deep distributed recurrent Q-networks (DDRQN), which enable teams of agents to learn to solve communication-based coordination tasks. In these tasks, the agents are not given any pre-designed communication protocol. Therefore, in order to successfully communicate, they must first automatically develop and agree upon their own communication protocol. We present empirical results on two multi-agent learning problems based on well-known riddles, demonstrating that DDRQN can successfully solve such tasks and discover elegant communication protocols to do so. To our knowledge, this is the first time deep reinforcement learning has succeeded in learning communication protocols. In addition, we present ablation experiments that confirm that each of the main components of the DDRQN architecture are critical to its success.

Paper Structure

This paper contains 16 sections, 5 equations, 11 figures, 1 table, 1 algorithm.

Figures (11)

  • Figure 1: Hats: Each prisoner can hear the answers from all preceding prisoners (to the left) and see the colour of the hats in front of him (to the right) but must guess his own hat colour.
  • Figure 2: Switch: Every day one prisoner gets sent to the interrogation room where he can see the switch and choose between actions "On", "Off", "Tell" and "None".
  • Figure 3: Hats: Each agent $m$ observes the answers, $a^k$, $k<m$, from all preceding agents and hat colour, $s^k$ in front of him, $k > m$. Both variable length sequences are processed through RNNs. First, the answers heard are passed through two single-layer MLPs, $z^k_a ={\text{MLP}(a^k)} \oplus {\text{MLP}(m,n)}$, and their outputs are added element-wise. $z^k_a$ is passed through an LSTM network $y^{k}_a, h^{k}_a = \text{LSTM}_a(z^k_a, h^{k-1}_a)$. Similarly for the observed hats we define $y^{k-1}_s,h^{k-1}_s = \text{LSTM}_s(z^k_s, h^{k-1}_s)$. The last values of the two LSTMs $y^{m-1}_a$ and $y^{n}_s$ are used to approximate $Q^m = \text{MLP}(y^{m-1}_a||y^{n}_s)$ from which the action $a^m$ is chosen.
  • Figure 4: Results on the hats riddle with $n = 10$ agents, comparing DDRQN with and without inter-agent weight sharing to a tabular Q-table and a hand-coded optimal strategy. The lines depict the average of $10$ runs and $95\%$ confidence intervals.
  • Figure 5: Hats: Using Curriculum Learning DDRQN achieves good performance for $n = 3 . . . 20$ agents, compared to the optimal strategy.
  • ...and 6 more figures