CPIG: Leveraging Consistency Policy with Intention Guidance for Multi-agent Exploration
Yuqian Fu, Yuanheng Zhu, Haoran Li, Zijie Zhao, Jiajun Chai, Dongbin Zhao
TL;DR
CPIG tackles the challenge of efficient exploration in cooperative MARL under sparse rewards by introducing a consistency policy that enables multimodal action generation and an intention-guided mechanism that shares a global state estimate among agents. The method integrates a discrete intention codebook learned via an Intention Learner, a masking strategy to balance exploration and exploitation, and a self-reference mechanism to constrain initial policy outputs, all within a centralized training with decentralized execution framework. Empirically, CPIG matches baselines in dense-reward scenarios and surpasses state-of-the-art methods by about 20% in sparse rewards across MPE and MAMuJoCo, while also offering substantial time efficiency gains over diffusion-based policies. These results demonstrate that combining a fast, multimodal consistency policy with shared intention guidance yields improved cooperative exploration, scalability, and practical applicability in multi-agent systems.
Abstract
Efficient exploration is crucial in cooperative multi-agent reinforcement learning (MARL), especially in sparse-reward settings. However, due to the reliance on the unimodal policy, existing methods are prone to falling into the local optima, hindering the effective exploration of better policies. Furthermore, in sparse-reward settings, each agent tends to receive a scarce reward, which poses significant challenges to inter-agent cooperation. This not only increases the difficulty of policy learning but also degrades the overall performance of multi-agent tasks. To address these issues, we propose a Consistency Policy with Intention Guidance (CPIG), with two primary components: (a) introducing a multimodal policy to enhance the agent's exploration capability, and (b) sharing the intention among agents to foster agent cooperation. For component (a), CPIG incorporates a Consistency model as the policy, leveraging its multimodal nature and stochastic characteristics to facilitate exploration. Regarding component (b), we introduce an Intention Learner to deduce the intention on the global state from each agent's local observation. This intention then serves as a guidance for the Consistency Policy, promoting cooperation among agents. The proposed method is evaluated in multi-agent particle environments (MPE) and multi-agent MuJoCo (MAMuJoCo). Empirical results demonstrate that our method not only achieves comparable performance to various baselines in dense-reward environments but also significantly enhances performance in sparse-reward settings, outperforming state-of-the-art (SOTA) algorithms by 20%.
