Table of Contents
Fetching ...

Geometrically-Constrained Agent for Spatial Reasoning

Zeren Chen, Xiaoya Lu, Zhijie Zheng, Pengrui Li, Lehan He, Yijin Zhou, Jing Shao, Bohan Zhuang, Lu Sheng

TL;DR

The paper tackles the semantic-to-geometric gap in Vision-Language Models for spatial reasoning by introducing Geometrically-Constrained Agent (GCA), a training-free two-stage framework that converts ambiguous queries into a formal task constraint ${\mathcal{C}}_{task}$ comprising ${\mathcal{C}}_{\mathcal{R}}$ and ${\mathcal{C}}_{\mathcal{O}}$. The VLM first acts as a semantic analyst to generate ${\mathcal{C}}_{task}$ and then as a constrained task solver to orchestrate perception and deterministic geometric computation within those bounds, aided by a toolbox and knowledge-augmented code generation. Across MMSI-Bench, MindCube-tiny, OmniSpatial, SPBench, and CV-Bench, GCA achieves state-of-the-art results with average improvements around 37% over strong baselines, and demonstrates robust generalization across multiple foundation VLMs. While it incurs higher computational cost than end-to-end prompts, the approach offers verifiability and a transparent reasoning pathway, and it points to future work on temporal reasoning and trans-normal supervision to train more efficient end-to-end spatial VLMs.

Abstract

Vision Language Models (VLMs) exhibit a fundamental semantic-to-geometric gap in spatial reasoning: they excel at qualitative semantic inference but their reasoning operates within a lossy semantic space, misaligned with high-fidelity geometry. Current paradigms fail to bridge this gap. Training-based methods suffer from an ``oracle paradox,'' learning flawed spatial logic from imperfect oracles. Tool-integrated methods constrain the final computation but critically leave the VLM's planning process unconstrained, resulting in geometrically flawed plans. In this work, we propose Geometrically-Constrained Agent (GCA), a training-free agentic paradigm that resolves this gap by introducing a formal task constraint. Specifically, we strategically decouples the VLM's role into two stages. First, acting as a semantic analyst, the VLM translates the user's ambiguous query into the formal, verifiable task constraint, which defines the reference frame and objective. Second, acting as a task solver, the VLM generates and executes tool calls strictly within the deterministic bounds defined by the constraint. This geometrically-constrained reasoning strategy successfully resolve the semantic-to-geometric gap, yielding a robust and verifiable reasoning pathway for spatial reasoning. Comprehensive experiments demonstrate that GCA achieves SOTA performance on multiple spatial reasoning benchmarks, surpassing existing training-based and tool-integrated methods by ~27%. Please see our homepage at https://gca-spatial-reasoning.github.io.

Geometrically-Constrained Agent for Spatial Reasoning

TL;DR

The paper tackles the semantic-to-geometric gap in Vision-Language Models for spatial reasoning by introducing Geometrically-Constrained Agent (GCA), a training-free two-stage framework that converts ambiguous queries into a formal task constraint comprising and . The VLM first acts as a semantic analyst to generate and then as a constrained task solver to orchestrate perception and deterministic geometric computation within those bounds, aided by a toolbox and knowledge-augmented code generation. Across MMSI-Bench, MindCube-tiny, OmniSpatial, SPBench, and CV-Bench, GCA achieves state-of-the-art results with average improvements around 37% over strong baselines, and demonstrates robust generalization across multiple foundation VLMs. While it incurs higher computational cost than end-to-end prompts, the approach offers verifiability and a transparent reasoning pathway, and it points to future work on temporal reasoning and trans-normal supervision to train more efficient end-to-end spatial VLMs.

Abstract

Vision Language Models (VLMs) exhibit a fundamental semantic-to-geometric gap in spatial reasoning: they excel at qualitative semantic inference but their reasoning operates within a lossy semantic space, misaligned with high-fidelity geometry. Current paradigms fail to bridge this gap. Training-based methods suffer from an ``oracle paradox,'' learning flawed spatial logic from imperfect oracles. Tool-integrated methods constrain the final computation but critically leave the VLM's planning process unconstrained, resulting in geometrically flawed plans. In this work, we propose Geometrically-Constrained Agent (GCA), a training-free agentic paradigm that resolves this gap by introducing a formal task constraint. Specifically, we strategically decouples the VLM's role into two stages. First, acting as a semantic analyst, the VLM translates the user's ambiguous query into the formal, verifiable task constraint, which defines the reference frame and objective. Second, acting as a task solver, the VLM generates and executes tool calls strictly within the deterministic bounds defined by the constraint. This geometrically-constrained reasoning strategy successfully resolve the semantic-to-geometric gap, yielding a robust and verifiable reasoning pathway for spatial reasoning. Comprehensive experiments demonstrate that GCA achieves SOTA performance on multiple spatial reasoning benchmarks, surpassing existing training-based and tool-integrated methods by ~27%. Please see our homepage at https://gca-spatial-reasoning.github.io.

Paper Structure

This paper contains 37 sections, 2 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Overview.(a) Semantic-Geometric Gap. The geometric details required for spatial reasoning are lost when translating visual information into textual space, leading to VLM's flawed reasoning or unconstrained planning. (b) Geometrically-Constrained Spatial Reasoning. We propose a formal task constraint that serves as a deterministic bridge between semantics and geometry in spatial reasoning.
  • Figure 2: Overall Paradigm of GCA. Given a spatial reasoning query, our GCA leverages a geometrically-constrained reasoning strategy centered on the formal task constraint (${\mathcal{C}}_\text{task}$). The VLM first translates the ambiguous query into this explicit ${\mathcal{C}}_\text{task}$, establishing a non-negotiable reference frame (${\mathcal{C}}_{\mathcal{R}}$) and objective (${\mathcal{C}}_{\mathcal{O}}$). Strictly constrained by ${\mathcal{C}}_\text{task}$, the VLM then orchestrates a toolbox to perform deterministic geometric computation and derive the final answer.
  • Figure 3: Reference Frame. Here, $v_{\text{sink}\rightarrow\text{owen}}$ denotes a vector calculated by "$\text{normalize}\left(\text{Centroid(owen)}-\text{Centroid(sink)}\right)$".
  • Figure 4: Ablation Study on Formalization. We compare our method in against several baselines: (1) no tool integration ("Baseline (CoT-Only)"), (2) unconstrained tool integration with ("Tool (Prompt)") or without ("Tool (Uncon.)") hints, (3) using a human-annotated ${\mathcal{C}}_\text{task}$ ("Oracle (Anno.)").
  • Figure 5: Ablation Study on Generalizability across Different VLMs. Our GCA achieves an average of 37% relative performance improvement across all tested foundation VLMs.
  • ...and 8 more figures