Table of Contents
Fetching ...

Cooperative Multi-Agent Planning with Adaptive Skill Synthesis

Zhiyuan Li, Wenshuai Zhao, Joni Pajarinen

TL;DR

The paper tackles sample efficiency, interpretability, and transferability in cooperative multi-agent systems by introducing COMPASS, a decentralized framework that unites a Vision-Language Model (VLM) based closed-loop planner, an adaptive, demonstration-bootstrapped skill library, and a structured, multi-hop communication protocol. It demonstrates strong performance on SMACv2, particularly in Protoss tasks where it achieves a win rate of $0.57$ and surpasses baselines such as QMIX, MAPPO, HAPPO, and HASAC, while also highlighting the contributions of skill bootstrapping, communication, and self-reflection through extensive ablations. The approach emphasizes interpretable, code-based skills and dynamic strategy refinement in a partially observable, decentralized setting, offering a scalable path toward real-world multi-agent coordination. However, performance gaps in certain race settings (notably Zerg) indicate areas for further generalization and efficiency improvements across diverse unit compositions and tactics.

Abstract

Despite much progress in training distributed artificial intelligence (AI), building cooperative multi-agent systems with multi-agent reinforcement learning (MARL) faces challenges in sample efficiency, interpretability, and transferability. Unlike traditional learning-based methods that require extensive interaction with the environment, large language models (LLMs) demonstrate remarkable capabilities in zero-shot planning and complex reasoning. However, existing LLM-based approaches heavily rely on text-based observations and struggle with the non-Markovian nature of multi-agent interactions under partial observability. We present COMPASS, a novel multi-agent architecture that integrates vision-language models (VLMs) with a dynamic skill library and structured communication for decentralized closed-loop decision-making. The skill library, bootstrapped from demonstrations, evolves via planner-guided tasks to enable adaptive strategies. COMPASS propagates entity information through multi-hop communication under partial observability. Evaluations on the improved StarCraft Multi-Agent Challenge (SMACv2) demonstrate COMPASS's strong performance against state-of-the-art MARL baselines across both symmetric and asymmetric scenarios. Notably, in the symmetric Protoss 5v5 task, COMPASS achieved a 57\% win rate, representing a 30 percentage point advantage over QMIX (27\%). Project page can be found at https://stellar-entremet-1720bb.netlify.app/.

Cooperative Multi-Agent Planning with Adaptive Skill Synthesis

TL;DR

The paper tackles sample efficiency, interpretability, and transferability in cooperative multi-agent systems by introducing COMPASS, a decentralized framework that unites a Vision-Language Model (VLM) based closed-loop planner, an adaptive, demonstration-bootstrapped skill library, and a structured, multi-hop communication protocol. It demonstrates strong performance on SMACv2, particularly in Protoss tasks where it achieves a win rate of and surpasses baselines such as QMIX, MAPPO, HAPPO, and HASAC, while also highlighting the contributions of skill bootstrapping, communication, and self-reflection through extensive ablations. The approach emphasizes interpretable, code-based skills and dynamic strategy refinement in a partially observable, decentralized setting, offering a scalable path toward real-world multi-agent coordination. However, performance gaps in certain race settings (notably Zerg) indicate areas for further generalization and efficiency improvements across diverse unit compositions and tactics.

Abstract

Despite much progress in training distributed artificial intelligence (AI), building cooperative multi-agent systems with multi-agent reinforcement learning (MARL) faces challenges in sample efficiency, interpretability, and transferability. Unlike traditional learning-based methods that require extensive interaction with the environment, large language models (LLMs) demonstrate remarkable capabilities in zero-shot planning and complex reasoning. However, existing LLM-based approaches heavily rely on text-based observations and struggle with the non-Markovian nature of multi-agent interactions under partial observability. We present COMPASS, a novel multi-agent architecture that integrates vision-language models (VLMs) with a dynamic skill library and structured communication for decentralized closed-loop decision-making. The skill library, bootstrapped from demonstrations, evolves via planner-guided tasks to enable adaptive strategies. COMPASS propagates entity information through multi-hop communication under partial observability. Evaluations on the improved StarCraft Multi-Agent Challenge (SMACv2) demonstrate COMPASS's strong performance against state-of-the-art MARL baselines across both symmetric and asymmetric scenarios. Notably, in the symmetric Protoss 5v5 task, COMPASS achieved a 57\% win rate, representing a 30 percentage point advantage over QMIX (27\%). Project page can be found at https://stellar-entremet-1720bb.netlify.app/.

Paper Structure

This paper contains 21 sections, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of the COMPASS architecture, a novel framework that advances cooperative multi-agent decision-making through three synergistic components: (1) A VLM-based closed-loop planner that enables decentralized control by continuously processing multi-modal feedback and adapting strategies, addressing the non-Markovian challenge of multi-agent systems; (2) A dynamic skill synthesis mechanism that combines demonstration bootstrapping with incremental skill generation, improving sample efficiency and interpretability; and (3) A structured communication protocol that facilitates efficient information sharing through entity-based multi-hop propagation, enhancing cooperative perception under partial observability.
  • Figure 2: Visualization of COMPASS's dynamic task reasoning process in the StarCraft Multi-Agent Challenge (SMACv2) environment. The figure demonstrates how the VLM-based planner decomposes a complex final goal ("defeat all enemy units") into a sequence of concrete, executable sub-tasks that adapt to the changing battlefield conditions. This closed-loop task decomposition enables efficient coordination among multiple agents under partial observability, as each sub-task provides clear, actionable objectives that agents can execute while maintaining overall mission alignment.
  • Figure 3: Illustration of self-reflection. Following skill execution (e.g., 'stalker_cover') and feedback, COMPASS assesses performance. This analysis guides further skill generation to refine tactics (e.g., registering an 'Enhanced Stalker Script Type Aggressive').
  • Figure 4: Overview of Adaptive Skill Synthesis. VLMs perform (Top) Bootstrapping by analyzing offline data for initial Tactic Analysis and Skill Generation into a Skill Library. (Bottom) Incremental Synthesis uses Task Reasoning to dynamically generate or enhance code-based skills, evolving the library for new tasks. The skills follow a structured decision-making pipeline with two core components: score_target(unit) for dynamic target prioritization and control_logic() for coordinating behavior. Textual observations are parsed into structured data (obs_data), mapping raw text to attributes, e.g., "Can move North: yes" -> can_move='north': True.
  • Figure 5: Illustration of COMPASS's structured multi-hop communication protocol that enables efficient information sharing under partial observability. The figure demonstrates how information about Enemy #1 propagates to the Ego agent through a chain of allied units (Ally #1, #2, #3), despite Enemy #1 being outside Ego's sight range. Each dashed circle represents an agent's local observation field, while arrows indicate the flow of entity-based information sharing. This mechanism enables agents to build a more holistic understanding of the environment by propagating critical information (e.g., enemy positions, status) through intermediate allies, effectively addressing the partial observability challenge in decentralized multi-agent systems.
  • ...and 5 more figures