Table of Contents
Fetching ...

From Solo to Symphony: Orchestrating Multi-Agent Collaboration with Single-Agent Demos

Xun Wang, Zhuoran Li, Yanshan Lin, Hai Zhong, Longbo Huang

TL;DR

This work addresses the inefficiency of training multi-agent policies from scratch by proposing SoCo, a framework that leverages abundant solo demonstrations to accelerate cooperative MARL. SoCo pretrains a shared solo policy from solo data, decomposes cooperative observations into solo views, and fuses solo actions with a gating selector and an action editor to adapt to multi-agent coordination. Through extensive experiments on nine tasks with backbone methods MATD3 and HATD3, SoCo demonstrates substantial improvements in sample efficiency and final performance, effectively mitigating observation mismatch and domain shift between solo and cooperative settings. The approach offers a scalable, plug-and-play strategy to broaden the practicality and applicability of cooperative MARL in real-world scenarios.

Abstract

Training a team of agents from scratch in multi-agent reinforcement learning (MARL) is highly inefficient, much like asking beginners to play a symphony together without first practicing solo. Existing methods, such as offline or transferable MARL, can ease this burden, but they still rely on costly multi-agent data, which often becomes the bottleneck. In contrast, solo experiences are far easier to obtain in many important scenarios, e.g., collaborative coding, household cooperation, and search-and-rescue. To unlock their potential, we propose Solo-to-Collaborative RL (SoCo), a framework that transfers solo knowledge into cooperative learning. SoCo first pretrains a shared solo policy from solo demonstrations, then adapts it for cooperation during multi-agent training through a policy fusion mechanism that combines an MoE-like gating selector and an action editor. Experiments across diverse cooperative tasks show that SoCo significantly boosts the training efficiency and performance of backbone algorithms. These results demonstrate that solo demonstrations provide a scalable and effective complement to multi-agent data, making cooperative learning more practical and broadly applicable.

From Solo to Symphony: Orchestrating Multi-Agent Collaboration with Single-Agent Demos

TL;DR

This work addresses the inefficiency of training multi-agent policies from scratch by proposing SoCo, a framework that leverages abundant solo demonstrations to accelerate cooperative MARL. SoCo pretrains a shared solo policy from solo data, decomposes cooperative observations into solo views, and fuses solo actions with a gating selector and an action editor to adapt to multi-agent coordination. Through extensive experiments on nine tasks with backbone methods MATD3 and HATD3, SoCo demonstrates substantial improvements in sample efficiency and final performance, effectively mitigating observation mismatch and domain shift between solo and cooperative settings. The approach offers a scalable, plug-and-play strategy to broaden the practicality and applicability of cooperative MARL in real-world scenarios.

Abstract

Training a team of agents from scratch in multi-agent reinforcement learning (MARL) is highly inefficient, much like asking beginners to play a symphony together without first practicing solo. Existing methods, such as offline or transferable MARL, can ease this burden, but they still rely on costly multi-agent data, which often becomes the bottleneck. In contrast, solo experiences are far easier to obtain in many important scenarios, e.g., collaborative coding, household cooperation, and search-and-rescue. To unlock their potential, we propose Solo-to-Collaborative RL (SoCo), a framework that transfers solo knowledge into cooperative learning. SoCo first pretrains a shared solo policy from solo demonstrations, then adapts it for cooperation during multi-agent training through a policy fusion mechanism that combines an MoE-like gating selector and an action editor. Experiments across diverse cooperative tasks show that SoCo significantly boosts the training efficiency and performance of backbone algorithms. These results demonstrate that solo demonstrations provide a scalable and effective complement to multi-agent data, making cooperative learning more practical and broadly applicable.

Paper Structure

This paper contains 46 sections, 19 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: SoCo framework. A shared solo policy is pretrained from demonstrations and kept frozen, then reused through observation decomposition during cooperative training. Coordination ability is injected by the Policy Fusion module, where the Gating Selector selects suitable solo actions and the Action Editor fine-tunes them to mitigate domain shift.
  • Figure 2: Training curves on nine tasks. Results are averaged over three random seeds, with solid and dashed lines indicating the mean performance and shaded areas representing one standard deviation.
  • Figure 3: Ablation study of SoCo. Results are averaged over three random seeds, with solid and dashed lines indicating the mean performance and shaded areas representing one standard deviation.
  • Figure 4: All the cooperative tasks in our experiments.
  • Figure 5: Solo tasks corresponding to each cooperative scenario.