Table of Contents
Fetching ...

Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning

Hyungho Na, Yunkyeong Seo, Il-chul Moon

TL;DR

Efficient episodic Memory Utilization (EMU) for MARL introduces a novel reward structure called episodic incentive based on the desirability of states, which improves the TD target in Q-learning and acts as an additional incentive for desirable transitions.

Abstract

In cooperative multi-agent reinforcement learning (MARL), agents aim to achieve a common goal, such as defeating enemies or scoring a goal. Existing MARL algorithms are effective but still require significant learning time and often get trapped in local optima by complex tasks, subsequently failing to discover a goal-reaching policy. To address this, we introduce Efficient episodic Memory Utilization (EMU) for MARL, with two primary objectives: (a) accelerating reinforcement learning by leveraging semantically coherent memory from an episodic buffer and (b) selectively promoting desirable transitions to prevent local convergence. To achieve (a), EMU incorporates a trainable encoder/decoder structure alongside MARL, creating coherent memory embeddings that facilitate exploratory memory recall. To achieve (b), EMU introduces a novel reward structure called episodic incentive based on the desirability of states. This reward improves the TD target in Q-learning and acts as an additional incentive for desirable transitions. We provide theoretical support for the proposed incentive and demonstrate the effectiveness of EMU compared to conventional episodic control. The proposed method is evaluated in StarCraft II and Google Research Football, and empirical results indicate further performance improvement over state-of-the-art methods.

Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning

TL;DR

Efficient episodic Memory Utilization (EMU) for MARL introduces a novel reward structure called episodic incentive based on the desirability of states, which improves the TD target in Q-learning and acts as an additional incentive for desirable transitions.

Abstract

In cooperative multi-agent reinforcement learning (MARL), agents aim to achieve a common goal, such as defeating enemies or scoring a goal. Existing MARL algorithms are effective but still require significant learning time and often get trapped in local optima by complex tasks, subsequently failing to discover a goal-reaching policy. To address this, we introduce Efficient episodic Memory Utilization (EMU) for MARL, with two primary objectives: (a) accelerating reinforcement learning by leveraging semantically coherent memory from an episodic buffer and (b) selectively promoting desirable transitions to prevent local convergence. To achieve (a), EMU incorporates a trainable encoder/decoder structure alongside MARL, creating coherent memory embeddings that facilitate exploratory memory recall. To achieve (b), EMU introduces a novel reward structure called episodic incentive based on the desirability of states. This reward improves the TD target in Q-learning and acts as an additional incentive for desirable transitions. We provide theoretical support for the proposed incentive and demonstrate the effectiveness of EMU compared to conventional episodic control. The proposed method is evaluated in StarCraft II and Google Research Football, and empirical results indicate further performance improvement over state-of-the-art methods.
Paper Structure (46 sections, 2 theorems, 22 equations, 34 figures, 8 tables, 2 algorithms)

This paper contains 46 sections, 2 theorems, 22 equations, 34 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1

Given a transition $(s,\bm{a},r,s')$ and $H(x')$, let $L_{\theta}$ be the Q-learning loss with additional transition reward, i.e., $L_{\theta}:= {(y(s,\bm{a})+{r^{EC}}(s,\bm{a},s') - {Q_{tot}}(s,\bm{a};\theta))^2}$ where ${r^{EC}}(s,\bm{a},s') := \lambda (r(s,\bm{a}) + \gamma H(x') - {Q_\theta }(s,\

Figures (34)

  • Figure 1: Overview of EMU framework.
  • Figure 2: t-SNE of sampled embedding $x \in {\mathcal{D}}_{E}$. Colors from red to purple (rainbow) represent from low return to high return.
  • Figure 3: Episodic incentive. Test trajectories are plotted on the embedded space with sampled memories in ${\mathcal{D}}_{E}$, denoted with dotted markers. Star markers and numbers represent the desirability of state and timestep in the episode, respectively. Color represents the same semantics as Figure \ref{['fig:t-SNE']}.
  • Figure 4: Performance comparison of EMU against baseline algorithms on three easy and hard SMAC maps: 1c3s5z, 3s_vs_5z, and 5m_vs_6m, and three super hard SMAC maps: MMM2, 6h_vs_8z, and 3s5z_vs_3s6z.
  • Figure 5: Performance comparison of EMU against baseline algorithms on Google Research Football.
  • ...and 29 more figures

Theorems & Definitions (5)

  • Definition 1
  • Theorem 1
  • Theorem 2
  • proof
  • proof