Table of Contents
Fetching ...

Heterogeneous Value Decomposition Policy Fusion for Multi-Agent Cooperation

Siying Wang, Yang Zhou, Zhitong Zhao, Ruoning Zhang, Jinliang Shao, Wenyu Chen, Yuhua Cheng

TL;DR

Cooperative multi-agent RL often relies on value decomposition under the $IGM$ principle, but existing VD methods trade off representational capacity against training efficiency. The paper introduces Heterogeneous Policy Fusion (HPF), which extends two VD policies into a composite policy set $\Pi=[\boldsymbol{\pi}_{\alpha},\boldsymbol{\pi}_{\beta}]$ and adaptively fuses them via a Boltzmann-based selector, augmented by an instructive KL constraint to align local policies. Empirical results on Matrix Game, StarCraft II SMAC, and Predator-Prey demonstrate that HPF improves sample efficiency and overall performance over strong baselines, with ablations confirming the value of value-guided policy sampling and policy extension. HPF offers a practical, easy-to-implement path to robust cooperative behavior by leveraging existing VD methods without designing new factorization schemes.

Abstract

Value decomposition (VD) has become one of the most prominent solutions in cooperative multi-agent reinforcement learning. Most existing methods generally explore how to factorize the joint value and minimize the discrepancies between agent observations and characteristics of environmental states. However, direct decomposition may result in limited representation or difficulty in optimization. Orthogonal to designing a new factorization scheme, in this paper, we propose Heterogeneous Policy Fusion (HPF) to integrate the strengths of various VD methods. We construct a composite policy set to select policies for interaction adaptively. Specifically, this adaptive mechanism allows agents' trajectories to benefit from diverse policy transitions while incorporating the advantages of each factorization method. Additionally, HPF introduces a constraint between these heterogeneous policies to rectify the misleading update caused by the unexpected exploratory or suboptimal non-cooperation. Experimental results on cooperative tasks show HPF's superior performance over multiple baselines, proving its effectiveness and ease of implementation.

Heterogeneous Value Decomposition Policy Fusion for Multi-Agent Cooperation

TL;DR

Cooperative multi-agent RL often relies on value decomposition under the principle, but existing VD methods trade off representational capacity against training efficiency. The paper introduces Heterogeneous Policy Fusion (HPF), which extends two VD policies into a composite policy set and adaptively fuses them via a Boltzmann-based selector, augmented by an instructive KL constraint to align local policies. Empirical results on Matrix Game, StarCraft II SMAC, and Predator-Prey demonstrate that HPF improves sample efficiency and overall performance over strong baselines, with ablations confirming the value of value-guided policy sampling and policy extension. HPF offers a practical, easy-to-implement path to robust cooperative behavior by leveraging existing VD methods without designing new factorization schemes.

Abstract

Value decomposition (VD) has become one of the most prominent solutions in cooperative multi-agent reinforcement learning. Most existing methods generally explore how to factorize the joint value and minimize the discrepancies between agent observations and characteristics of environmental states. However, direct decomposition may result in limited representation or difficulty in optimization. Orthogonal to designing a new factorization scheme, in this paper, we propose Heterogeneous Policy Fusion (HPF) to integrate the strengths of various VD methods. We construct a composite policy set to select policies for interaction adaptively. Specifically, this adaptive mechanism allows agents' trajectories to benefit from diverse policy transitions while incorporating the advantages of each factorization method. Additionally, HPF introduces a constraint between these heterogeneous policies to rectify the misleading update caused by the unexpected exploratory or suboptimal non-cooperation. Experimental results on cooperative tasks show HPF's superior performance over multiple baselines, proving its effectiveness and ease of implementation.

Paper Structure

This paper contains 29 sections, 2 theorems, 18 equations, 6 figures, 8 tables, 1 algorithm.

Key Result

Proposition 1

If $\Pi=\left[\boldsymbol{\pi}_\alpha, \boldsymbol{\pi}_\beta \right]$ is an extended policy set formed by existing various VD policies, then $\Pi$ still satisfies the IGM criterion.

Figures (6)

  • Figure 1: The illustration of the distinction between HPF and traditional VD methods. The traditional scheme directly aligns the optimal joint action and optimizes the central value function with the presupposed VD policy itself. The proposed HPF integrates the benefits of different types of VD policies, and expands them into a policy set to sample the experiences for capturing further performance improvement. These VD policies both participate in the interactions with the environment and learning in an adaptive manner.
  • Figure 2: The architecture of HPF. (a) The VD method with surrogate target. (b) The VD method with network parameters constraint. (c) The instructive constraint between heterogeneous utility functions. The policies of both VD methods constitute a composite policy set and interact with the environment after sampling.
  • Figure 3: Comparison results on the selected scenarios in the StarCraft Multi-Agent Challenge.
  • Figure 4: Comparison results in the predator and prey.
  • Figure 5: Ablation studies of the random candidate VD policy sampling in HPF.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Proposition 1
  • proof
  • Proposition 2
  • proof