Table of Contents
Fetching ...

Communicating about Space: Language-Mediated Spatial Integration Across Partial Views

Ankur Sikarwar, Debangan Mishra, Sudarshan Nikhil, Ponnurangam Kumaraguru, Aishwarya Agrawal

Abstract

Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Language Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To study this systematically, we introduce COSMIC, a benchmark for Collaborative Spatial Communication. In this setting, two static MLLM agents observe a 3D indoor environment from different viewpoints and exchange natural-language messages to solve spatial queries. COSMIC contains 899 diverse scenes and 1250 question-answer pairs spanning five tasks. We find a capability hierarchy, MLLMs are most reliable at identifying shared anchor objects across views, perform worse on relational reasoning, and largely fail at building globally consistent maps, performing near chance, even for frontier models. Moreover, we find thinking capability yields gains in anchor grounding, but is insufficient for higher-level spatial communication. To contextualize model behavior, we collect 250 human-human dialogues. Humans achieve 95% aggregate accuracy, while the best model, Gemini-3-Pro-Thinking, reaches 72%, leaving substantial room for improvement. Moreover, human conversations grow more precise as partners align on a shared spatial understanding, whereas MLLMs keep exploring without converging, suggesting limited capacity to form and sustain a robust shared mental model throughout the dialogue. Our code and data is available at https://github.com/ankursikarwar/Cosmic.

Communicating about Space: Language-Mediated Spatial Integration Across Partial Views

Abstract

Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Language Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To study this systematically, we introduce COSMIC, a benchmark for Collaborative Spatial Communication. In this setting, two static MLLM agents observe a 3D indoor environment from different viewpoints and exchange natural-language messages to solve spatial queries. COSMIC contains 899 diverse scenes and 1250 question-answer pairs spanning five tasks. We find a capability hierarchy, MLLMs are most reliable at identifying shared anchor objects across views, perform worse on relational reasoning, and largely fail at building globally consistent maps, performing near chance, even for frontier models. Moreover, we find thinking capability yields gains in anchor grounding, but is insufficient for higher-level spatial communication. To contextualize model behavior, we collect 250 human-human dialogues. Humans achieve 95% aggregate accuracy, while the best model, Gemini-3-Pro-Thinking, reaches 72%, leaving substantial room for improvement. Moreover, human conversations grow more precise as partners align on a shared spatial understanding, whereas MLLMs keep exploring without converging, suggesting limited capacity to form and sustain a robust shared mental model throughout the dialogue. Our code and data is available at https://github.com/ankursikarwar/Cosmic.

Paper Structure

This paper contains 30 sections, 20 figures, 2 tables.

Figures (20)

  • Figure 1: Left: MLLM agents attempt to communicate and build a spatial mental model for answering questions in COSMIC. Right: Answerer and Helper agents integrate distinct egocentric views via communication to answer the question. Humans demonstrate efficient and precise strategies while MLLM agents are more verbose, inefficient, and fail to build and maintain a robust shared mental model.
  • Figure 2: Overview of the COSMIC benchmark. Each task pair shows the Answerer's view (left) and the Helper's view (right), along with the question and options posed to the Answerer.
  • Figure 3: Benchmark curation pipeline. Our pipeline involves scene generation, sampling complementary agent viewpoints, generating questions using templates and unique object description followed by paraphrasing the questions.
  • Figure 4: COSMIC benchmark composition.Left: Scenes from our benchmark. Center Top: Distribution of room types. Center Bottom: Distribution of scene clutter (number of object instances per scene). Right Top: Object-category frequencies across the benchmark. Right Bottom: Word cloud representing the most frequent spatial and object-related terms in the dataset.
  • Figure 5: Top: Evaluation on COSMIC. Error bars denote 90% confidence intervals computed via bootstrap resampling. Dashed lines indicate chance levels (25% for 4-choice MCQ, 50% for binary map tasks and 30% for overall). Bottom: Evaluation on COSMIC-Human.
  • ...and 15 more figures