Table of Contents
Fetching ...

Harnessing Language for Coordination: A Framework and Benchmark for LLM-Driven Multi-Agent Control

Timothée Anne, Noah Syrkis, Meriem Elhosni, Florian Turati, Franck Legendre, Alain Jaquier, Sebastian Risi

TL;DR

This work addresses how large language models can coordinate thousands of agents in real-time by introducing HIVE, a hybrid system that converts human strategic input into LLM-generated plans and then assigns behavior trees to units for execution in a real-time RTS benchmark. The framework combines a two-phase interaction—dialogue-driven planning and plan-based execution—with a structured plan format and a BT-based control layer, enabling scalable multi-agent coordination. Through a dedicated benchmark with five ability tests and evaluations of nine LLMs, the authors show that generalist LLMs can achieve complex coordination when aided by human input, but face limitations in spatial reasoning, long-horizon planning, and input sensitivity; textual map descriptions tend to outperform image-based inputs in current settings. The findings highlight the potential of hybrid human-LLM collaboration for multi-agent coordination while identifying practical hurdles and avenues for improvement, including multimodal map understanding and scalable, real-time inference.

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. Their potential to facilitate human coordination with many agents is a promising but largely under-explored area. Such capabilities would be helpful in disaster response, urban planning, and real-time strategy scenarios. In this work, we introduce (1) a real-time strategy game benchmark designed to evaluate these abilities and (2) a novel framework we term HIVE. HIVE empowers a single human to coordinate swarms of up to 2,000 agents through a natural language dialog with an LLM. We present promising results on this multi-agent benchmark, with our hybrid approach solving tasks such as coordinating agent movements, exploiting unit weaknesses, leveraging human annotations, and understanding terrain and strategic points. Our findings also highlight critical limitations of current models, including difficulties in processing spatial visual information and challenges in formulating long-term strategic plans. This work sheds light on the potential and limitations of LLMs in human-swarm coordination, paving the way for future research in this area. The HIVE project page, hive.syrkis.com, includes videos of the system in action.

Harnessing Language for Coordination: A Framework and Benchmark for LLM-Driven Multi-Agent Control

TL;DR

This work addresses how large language models can coordinate thousands of agents in real-time by introducing HIVE, a hybrid system that converts human strategic input into LLM-generated plans and then assigns behavior trees to units for execution in a real-time RTS benchmark. The framework combines a two-phase interaction—dialogue-driven planning and plan-based execution—with a structured plan format and a BT-based control layer, enabling scalable multi-agent coordination. Through a dedicated benchmark with five ability tests and evaluations of nine LLMs, the authors show that generalist LLMs can achieve complex coordination when aided by human input, but face limitations in spatial reasoning, long-horizon planning, and input sensitivity; textual map descriptions tend to outperform image-based inputs in current settings. The findings highlight the potential of hybrid human-LLM collaboration for multi-agent coordination while identifying practical hurdles and avenues for improvement, including multimodal map understanding and scalable, real-time inference.

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. Their potential to facilitate human coordination with many agents is a promising but largely under-explored area. Such capabilities would be helpful in disaster response, urban planning, and real-time strategy scenarios. In this work, we introduce (1) a real-time strategy game benchmark designed to evaluate these abilities and (2) a novel framework we term HIVE. HIVE empowers a single human to coordinate swarms of up to 2,000 agents through a natural language dialog with an LLM. We present promising results on this multi-agent benchmark, with our hybrid approach solving tasks such as coordinating agent movements, exploiting unit weaknesses, leveraging human annotations, and understanding terrain and strategic points. Our findings also highlight critical limitations of current models, including difficulties in processing spatial visual information and challenges in formulating long-term strategic plans. This work sheds light on the potential and limitations of LLMs in human-swarm coordination, paving the way for future research in this area. The HIVE project page, hive.syrkis.com, includes videos of the system in action.

Paper Structure

This paper contains 57 sections, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Example of interactions between the player and HIVE. To win this scenario, the player and HIVE have to devise a plan to prevent the enemy units (in red) from reaching the center of the camp. To do so, the player proposes a high-level description, which HIVE takes as input to write down an actual plan using a structured output. In this example, HIVE also briefly describes the situation and the plan. Notice that even though the situational description is not always correct (ten bridges instead of nine), the actual plan is correct and wins this scenario. In this example, HIVE uses claude-3-5-sonnet-20241022 as the underlying LLM. Snapshots from the corresponding game are shown at the bottom.
  • Figure 2: Overview of the HIVE approach. HIVE enables players to command up to several thousand units by delegating the tedious task of assigning behaviors and objectives to each unit to a general-purpose LLM. HIVE operates in two main phases: (1) a discussion phase with an LLM to develop a plan before the game starts and (2) the execution phase, where the plan assigns a behavior tree to each unit. We implemented the modules that handle the units in JAX jax2018github and those that manage high-level information in Python. See the main text for a more detailed description.
  • Figure 3: The HIVE Benchmark Ability Tests. (a) Coordinate where the player has to eliminate all the enemies using 1,000 units, (b) Exploit weakness where the player has to efficiently use the three types of units to eliminate the enemies, (c) Follow markers where the player has to bring at least one unit to the south, (d) Exploit terrain where the player has to bring at least one unit to the opposite corner of the map, and (e) Strategize points where the player has to prevent the enemies from reaching the center of their camp.
  • Figure 4: Successful plans by HIVE using different LLMs for each ability test. HIVE can translate high-level commands into successful plans, from coordinating thousands of units to exploiting the terrain, enemy weaknesses, or strategic points. Blue units are the player's units. Red units are the enemies.
  • Figure 5: Ability evaluations of HIVE using nine LLMs with the same ten prompts for each ability test. Apart from Follow Markers, the different ability tests are solvable but challenging for all LLMs. Sonnet performs best, while a small LLM like Llama3-8B completely fails.
  • ...and 9 more figures