Table of Contents
Fetching ...

Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen, Nakul Gopalan

Abstract

Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG-Bench, specifically designed to evaluate metric-semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.

Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

Abstract

Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG-Bench, specifically designed to evaluate metric-semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.
Paper Structure (7 sections, 6 figures, 4 tables)

This paper contains 7 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: When a human specifies an open world metric-semantic query, referring to a location in the world, MAPG decomposes the query into subparts as learned probability distributions and composes them for grounding.
  • Figure 2: MAPG overview. (a) The agent observes a $3$D environment and receives a natural-language spatial query. (b) Agentic spatial reasoning layer: The Orchestrator decomposes the query into anchor, relation, and metric components and grounds the anchor instance using multi-view evidence and the $3$D scene graph. (c) MAPG composes probabilistic kernels, producing a continuous target-location PDF in the global frame that can be used as an actionable navigation or object selection goal.
  • Figure 3: MAPG system overview. Egocentric observations $o_t$ are fused into a $3$D scene graph $\Gamma_t$ of labeled objects with poses and bounding boxes. The Orchestrator parses a language query $q$ into composable symbolic predicates $(o_r, r, d)$. The Grounding Agent resolves the referent $o_r$ against $\Gamma_t$, Spatial Agent constructs analytic kernels $\kappa = \kappa_s \cdot \kappa_m \cdot \kappa_d$ and composes them into a goal distribution $P(x)$ over free space. The Verifier checks coherence and triggers corrective retries when needed. Finally, the Planner selects a navigation action $a_t$, closing the loop as new observations update $\Gamma_t$. Our contributions:*
  • Figure 4: MAPG grounding for the query “Where is 2m to the right of the fridge?” Semantic grounding identifies the fridge, the metric kernel models the 2m offset, and the directional spatial kernel captures the predicate “right of.” Their analytical composition yields a planner-ready goal distribution over feasible target locations.
  • Figure 5: Distribution of query categories (left) and anchor object labels (right) in MAPG-Bench.
  • ...and 1 more figures