Table of Contents
Fetching ...

Heterogeneous Agent Collaborative Reinforcement Learning

Zhixia Zhang, Zixuan Huang, Xin Xia, Deqing Wang, Fuzhen Zhuang, Shuai Ma, Ning Ding, Yaodong Yang, Jianxin Li, Yikun Ban

TL;DR

HACPO is proposed, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer and introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation and optimization correctness.

Abstract

We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on-/off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one-directional teacher-to-student transfer. Building on this paradigm, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation and optimization correctness. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO by an average of 3.3\% while using only half the rollout cost.

Heterogeneous Agent Collaborative Reinforcement Learning

TL;DR

HACPO is proposed, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer and introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation and optimization correctness.

Abstract

We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on-/off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one-directional teacher-to-student transfer. Building on this paradigm, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation and optimization correctness. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO by an average of 3.3\% while using only half the rollout cost.
Paper Structure (34 sections, 8 theorems, 54 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 34 sections, 8 theorems, 54 equations, 4 figures, 7 tables, 1 algorithm.

Key Result

Theorem 4.1

Let agent $k$ generate $G$ responses $\{y^{(k)}_{t,i}\}_{i=1}^G \sim \pi_{\theta_k}(\cdot \mid q_t)$ at training step $t$, and let $\mu_t^{(k)}$ denote the mixed-response advantage baseline used by HACPO for agent $k$. Then, under the shared reward function $R(\cdot)$, the baseline is unbiased in th where the expectation is taken over the stochasticity of response generation.

Figures (4)

  • Figure 1: The significant differences among Multi-Agent RL, Knowledge Distillation, and the proposed HACRL. HACRL targets independent execution with collaborative optimization.
  • Figure 2: In HACPO, shared rollouts from multiple heterogeneous agents are leveraged for collaborative training. Built upon vanilla RL Optimization, HACPO introduces four algorithmic innovations to mitigate capability and policy distribution discrepancy.
  • Figure 3: Training curves of GSPO and HACPO
  • Figure 4: The Ablation of Stepwise Clipping

Theorems & Definitions (24)

  • Definition 2.1: Heterogeneous State
  • Definition 2.2: Heterogeneous Size
  • Definition 2.3: Heterogeneous Model
  • Remark 2.4
  • Definition 2.5: HACRL Problem
  • Remark 3.1
  • Theorem 4.1: Unbiased Advantage Estimator
  • Corollary 4.2: Unbiased Advantage
  • Theorem 4.3: Gradient Alignment and Effectiveness of HACPO
  • Remark 4.2
  • ...and 14 more