Table of Contents
Fetching ...

CRAFT: Grounded Multi-Agent Coordination Under Partial Information

Abhijnan Nath, Hannah VanderHoeven, Nikhil Krishnaswamy

Abstract

We introduce CRAFT, a multi-agent benchmark for evaluating pragmatic communication in large language models under strict partial information. In this setting, multiple agents with complementary but incomplete views must coordinate through natural language to construct a shared 3D structure that no single agent can fully observe. We formalize this problem as a multi-sender pragmatic reasoning task and provide a diagnostic framework that decomposes failures into spatial grounding, belief modeling and pragmatic communication errors, including a taxonomy of behavioral failure profiles in both frontier and open-weight models. Across a diverse set of models, including 8 open-weight and 7 frontier including reasoning models, we find that stronger reasoning ability does not reliably translate to better coordination: smaller open-weight models often match or outperform frontier systems, and improved individual communication does not guarantee successful collaboration. These results suggest that multi-agent coordination remains a fundamentally unsolved challenge for current language models. Our code can be found at https://github.com/csu-signal/CRAFT

CRAFT: Grounded Multi-Agent Coordination Under Partial Information

Abstract

We introduce CRAFT, a multi-agent benchmark for evaluating pragmatic communication in large language models under strict partial information. In this setting, multiple agents with complementary but incomplete views must coordinate through natural language to construct a shared 3D structure that no single agent can fully observe. We formalize this problem as a multi-sender pragmatic reasoning task and provide a diagnostic framework that decomposes failures into spatial grounding, belief modeling and pragmatic communication errors, including a taxonomy of behavioral failure profiles in both frontier and open-weight models. Across a diverse set of models, including 8 open-weight and 7 frontier including reasoning models, we find that stronger reasoning ability does not reliably translate to better coordination: smaller open-weight models often match or outperform frontier systems, and improved individual communication does not guarantee successful collaboration. These results suggest that multi-agent coordination remains a fundamentally unsolved challenge for current language models. Our code can be found at https://github.com/csu-signal/CRAFT

Paper Structure

This paper contains 66 sections, 1 theorem, 14 equations, 18 figures, 9 tables.

Key Result

Theorem 3.3

Under eqn:bps and eqn:joint_tom, let ${S_{\textrm{base}}}_i(\cdot \mid z_i^{\star}, c_i)$ denote the base speaker of director $D_i$ and let $z^{\star} = (z_1^{\star}, z_2^{\star}, z_3^{\star})$ denote the joint intention vector, where $c_i = (o_{i,t}, h_t, \{u_{j,t}\}_{j \neq i})$ is director $D_i$'

Figures (18)

  • Figure 1: CRAFT framework overview. A structure generator creates a target 3D object and three private 2D views for directors, enforcing information asymmetry. At each turn, directors produce instructions from their partial views, which a builder executes via PLACE, REMOVE, or CLARIFY actions in the CRAFT engine. The system logs task progress and evaluates communication using LLM judges for spatial grounding, mind modeling, and pragmatic sufficiency.
  • Figure 2: Director perspective views for structure_016 (25 blocks, complex tier). D1 (left wall), D2 (far wall), and D3 (right wall) each observe a fixed 2D projection across all vertical layers. The full grid minimap shows ground-truth stack heights.
  • Figure 3: Failure taxonomy over all turns across 15 director models.
  • Figure 4: LLM grader scores across three evaluation dimensions—spatial grounding (left), mind modeling (center), and pragmatic sufficiency (right)—broken down by question and model group. Error bars denote $\pm1$ standard error of the mean across all structure--turn--director observations per model (from independent LLM grader runs: SG and MM $n{=}3$; PS $n{=}2$).
  • Figure 5: Oracle-prescribed vs. attempted remove rate per turn, averaged across all 20 structures (shading = gap between lines). Each subplot title shows the mean gap and final-turn task progress.
  • ...and 13 more figures

Theorems & Definitions (5)

  • Definition 3.1: Bounded Pragmatic Speaker
  • Definition 3.2: Joint ToM Listener
  • Theorem 3.3: CRAFT as a Multi-Sender BPS
  • Definition 4.1
  • proof : Proof