Table of Contents
Fetching ...

Imagine, Initialize, and Explore: An Effective Exploration Method in Multi-Agent Reinforcement Learning

Zeyang Liu, Lipeng Wan, Xinrui Yang, Zhuoran Chen, Xingyu Chen, Xuguang Lan

TL;DR

IIE tackles exploration in cooperative MARL by using a transformer-based imagination to predict trajectories toward interaction states that influence others' transitions, and by teleporting agents to the imagined state before exploration. It combines a prompt-guided imagination module with an environment simulator to initialize at imagined critical states, forming an auto-curriculum that constrains the exploration space. The imagination model predicts sequences of states, observations, prompts, actions, and rewards in an autoregressive fashion and uses an influence value to bias imagination toward impactful interactions, formalized through $Q_j$ and $\mathcal{I}$. Empirical results on SMAC and SMACv2 show that IIE outperforms baselines, particularly in sparse-reward settings, and yields more effective curricula than generative models like CVAE-GAN or diffusion-based methods.

Abstract

Effective exploration is crucial to discovering optimal strategies for multi-agent reinforcement learning (MARL) in complex coordination tasks. Existing methods mainly utilize intrinsic rewards to enable committed exploration or use role-based learning for decomposing joint action spaces instead of directly conducting a collective search in the entire action-observation space. However, they often face challenges obtaining specific joint action sequences to reach successful states in long-horizon tasks. To address this limitation, we propose Imagine, Initialize, and Explore (IIE), a novel method that offers a promising solution for efficient multi-agent exploration in complex scenarios. IIE employs a transformer model to imagine how the agents reach a critical state that can influence each other's transition functions. Then, we initialize the environment at this state using a simulator before the exploration phase. We formulate the imagination as a sequence modeling problem, where the states, observations, prompts, actions, and rewards are predicted autoregressively. The prompt consists of timestep-to-go, return-to-go, influence value, and one-shot demonstration, specifying the desired state and trajectory as well as guiding the action generation. By initializing agents at the critical states, IIE significantly increases the likelihood of discovering potentially important under-explored regions. Despite its simplicity, empirical results demonstrate that our method outperforms multi-agent exploration baselines on the StarCraft Multi-Agent Challenge (SMAC) and SMACv2 environments. Particularly, IIE shows improved performance in the sparse-reward SMAC tasks and produces more effective curricula over the initialized states than other generative methods, such as CVAE-GAN and diffusion models.

Imagine, Initialize, and Explore: An Effective Exploration Method in Multi-Agent Reinforcement Learning

TL;DR

IIE tackles exploration in cooperative MARL by using a transformer-based imagination to predict trajectories toward interaction states that influence others' transitions, and by teleporting agents to the imagined state before exploration. It combines a prompt-guided imagination module with an environment simulator to initialize at imagined critical states, forming an auto-curriculum that constrains the exploration space. The imagination model predicts sequences of states, observations, prompts, actions, and rewards in an autoregressive fashion and uses an influence value to bias imagination toward impactful interactions, formalized through and . Empirical results on SMAC and SMACv2 show that IIE outperforms baselines, particularly in sparse-reward settings, and yields more effective curricula than generative models like CVAE-GAN or diffusion-based methods.

Abstract

Effective exploration is crucial to discovering optimal strategies for multi-agent reinforcement learning (MARL) in complex coordination tasks. Existing methods mainly utilize intrinsic rewards to enable committed exploration or use role-based learning for decomposing joint action spaces instead of directly conducting a collective search in the entire action-observation space. However, they often face challenges obtaining specific joint action sequences to reach successful states in long-horizon tasks. To address this limitation, we propose Imagine, Initialize, and Explore (IIE), a novel method that offers a promising solution for efficient multi-agent exploration in complex scenarios. IIE employs a transformer model to imagine how the agents reach a critical state that can influence each other's transition functions. Then, we initialize the environment at this state using a simulator before the exploration phase. We formulate the imagination as a sequence modeling problem, where the states, observations, prompts, actions, and rewards are predicted autoregressively. The prompt consists of timestep-to-go, return-to-go, influence value, and one-shot demonstration, specifying the desired state and trajectory as well as guiding the action generation. By initializing agents at the critical states, IIE significantly increases the likelihood of discovering potentially important under-explored regions. Despite its simplicity, empirical results demonstrate that our method outperforms multi-agent exploration baselines on the StarCraft Multi-Agent Challenge (SMAC) and SMACv2 environments. Particularly, IIE shows improved performance in the sparse-reward SMAC tasks and produces more effective curricula over the initialized states than other generative methods, such as CVAE-GAN and diffusion models.
Paper Structure (28 sections, 7 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 28 sections, 7 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: An overview of Imagine, Initialize, and Explore. In the pretraining phase, individual agents collect data from the initial state $s_0$ provided by the environment simulator. The interaction sequence is divided into several trajectory segments using influence values, which serve as the training dataset for the imagination model and few-shot demonstrations. Given $s_0$, the prompt generator is trained to produce critical states, and the imagination model learns to predict how to reach such critical states from $s_0$. After pretraining, the imagination model generates a trajectory from the initial state $s_0$ to a critical state $s_\mathcal{T}$ conditioned on $\mathcal{P}(s_0)$ sampled from the prompt generator, and the most related trajectory from the few-shot demonstration dataset. The agents are initialized at $s_\mathcal{T}$ by the environment simulator and then interact with the environment using the $\epsilon$-greedy strategy. We concatenate the imagined and the explored trajectory to train the joint policy in the centralized training phase.
  • Figure 2: Performance comparisons on the dense-reward SMAC and SMACv2 benchmarks.
  • Figure 3: Performance comparisons on the sparse-reward SMAC benchmark.
  • Figure 4: (a-d) Performance comparisons with different returning methods on the SMAC benchmark. (e-g) The mean health of allies and enemies, as well as the relative distance between two groups at the last state in the MMM2 scenario. (h) The 2D t-SNE embeddings of the trajectory returned from IIE in the MMM2 scenario after pretraining.
  • Figure 5: IIE with different prompts on the SMAC benchmark.