Table of Contents
Fetching ...

CPIG: Leveraging Consistency Policy with Intention Guidance for Multi-agent Exploration

Yuqian Fu, Yuanheng Zhu, Haoran Li, Zijie Zhao, Jiajun Chai, Dongbin Zhao

TL;DR

CPIG tackles the challenge of efficient exploration in cooperative MARL under sparse rewards by introducing a consistency policy that enables multimodal action generation and an intention-guided mechanism that shares a global state estimate among agents. The method integrates a discrete intention codebook learned via an Intention Learner, a masking strategy to balance exploration and exploitation, and a self-reference mechanism to constrain initial policy outputs, all within a centralized training with decentralized execution framework. Empirically, CPIG matches baselines in dense-reward scenarios and surpasses state-of-the-art methods by about 20% in sparse rewards across MPE and MAMuJoCo, while also offering substantial time efficiency gains over diffusion-based policies. These results demonstrate that combining a fast, multimodal consistency policy with shared intention guidance yields improved cooperative exploration, scalability, and practical applicability in multi-agent systems.

Abstract

Efficient exploration is crucial in cooperative multi-agent reinforcement learning (MARL), especially in sparse-reward settings. However, due to the reliance on the unimodal policy, existing methods are prone to falling into the local optima, hindering the effective exploration of better policies. Furthermore, in sparse-reward settings, each agent tends to receive a scarce reward, which poses significant challenges to inter-agent cooperation. This not only increases the difficulty of policy learning but also degrades the overall performance of multi-agent tasks. To address these issues, we propose a Consistency Policy with Intention Guidance (CPIG), with two primary components: (a) introducing a multimodal policy to enhance the agent's exploration capability, and (b) sharing the intention among agents to foster agent cooperation. For component (a), CPIG incorporates a Consistency model as the policy, leveraging its multimodal nature and stochastic characteristics to facilitate exploration. Regarding component (b), we introduce an Intention Learner to deduce the intention on the global state from each agent's local observation. This intention then serves as a guidance for the Consistency Policy, promoting cooperation among agents. The proposed method is evaluated in multi-agent particle environments (MPE) and multi-agent MuJoCo (MAMuJoCo). Empirical results demonstrate that our method not only achieves comparable performance to various baselines in dense-reward environments but also significantly enhances performance in sparse-reward settings, outperforming state-of-the-art (SOTA) algorithms by 20%.

CPIG: Leveraging Consistency Policy with Intention Guidance for Multi-agent Exploration

TL;DR

CPIG tackles the challenge of efficient exploration in cooperative MARL under sparse rewards by introducing a consistency policy that enables multimodal action generation and an intention-guided mechanism that shares a global state estimate among agents. The method integrates a discrete intention codebook learned via an Intention Learner, a masking strategy to balance exploration and exploitation, and a self-reference mechanism to constrain initial policy outputs, all within a centralized training with decentralized execution framework. Empirically, CPIG matches baselines in dense-reward scenarios and surpasses state-of-the-art methods by about 20% in sparse rewards across MPE and MAMuJoCo, while also offering substantial time efficiency gains over diffusion-based policies. These results demonstrate that combining a fast, multimodal consistency policy with shared intention guidance yields improved cooperative exploration, scalability, and practical applicability in multi-agent systems.

Abstract

Efficient exploration is crucial in cooperative multi-agent reinforcement learning (MARL), especially in sparse-reward settings. However, due to the reliance on the unimodal policy, existing methods are prone to falling into the local optima, hindering the effective exploration of better policies. Furthermore, in sparse-reward settings, each agent tends to receive a scarce reward, which poses significant challenges to inter-agent cooperation. This not only increases the difficulty of policy learning but also degrades the overall performance of multi-agent tasks. To address these issues, we propose a Consistency Policy with Intention Guidance (CPIG), with two primary components: (a) introducing a multimodal policy to enhance the agent's exploration capability, and (b) sharing the intention among agents to foster agent cooperation. For component (a), CPIG incorporates a Consistency model as the policy, leveraging its multimodal nature and stochastic characteristics to facilitate exploration. Regarding component (b), we introduce an Intention Learner to deduce the intention on the global state from each agent's local observation. This intention then serves as a guidance for the Consistency Policy, promoting cooperation among agents. The proposed method is evaluated in multi-agent particle environments (MPE) and multi-agent MuJoCo (MAMuJoCo). Empirical results demonstrate that our method not only achieves comparable performance to various baselines in dense-reward environments but also significantly enhances performance in sparse-reward settings, outperforming state-of-the-art (SOTA) algorithms by 20%.

Paper Structure

This paper contains 25 sections, 12 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: An example of cooperative exploration. The two-agent arm requires collaborative exploration to reach four targets at different locations. In a sparse reward setting, agents should reach all targets before receiving any reward, making exploration and cooperation more challenging. The top-right corner illustrates a visualization of the multimodal joint policy.
  • Figure 2: The overall framework of the proposed CPIG. Intention-guided diffusion policies facilitate cooperative exploration among multiple agents. The consistency policy utilizes the agent's observation and Gaussian noise as inputs, generating actions under the guidance of masked intention. During the training phase, in addition to policy learning, the self-reference mechanism provides gradient back-propagation based on the disparity between the agent's action and those from the reference buffer, thereby imposing a policy constraint.
  • Figure 3: The framework of the Intention Learner. The intention learner consists of an observation encoder, an intention codebook, and a state decoder, which learns discrete intention representations by reconstructing states. It operates differently during the training and execution process.
  • Figure 4: Demonstrations of five sparse-reward environments. Both HalfCheetah (2x3) and HalfCheetah (6x1) are represented as HalfCheetah.
  • Figure 5: Example results obtained by CPIG and baselines on exploration visitation. (a) shows a Reacher4 task, with four targets in different colors. (b)-(d) shows the state visitation of CPIG, HASAC and MAPPO. (e) counts the targets covered by policies during exploration.
  • ...and 3 more figures