Geometrically-Constrained Agent for Spatial Reasoning
Zeren Chen, Xiaoya Lu, Zhijie Zheng, Pengrui Li, Lehan He, Yijin Zhou, Jing Shao, Bohan Zhuang, Lu Sheng
TL;DR
The paper tackles the semantic-to-geometric gap in Vision-Language Models for spatial reasoning by introducing Geometrically-Constrained Agent (GCA), a training-free two-stage framework that converts ambiguous queries into a formal task constraint ${\mathcal{C}}_{task}$ comprising ${\mathcal{C}}_{\mathcal{R}}$ and ${\mathcal{C}}_{\mathcal{O}}$. The VLM first acts as a semantic analyst to generate ${\mathcal{C}}_{task}$ and then as a constrained task solver to orchestrate perception and deterministic geometric computation within those bounds, aided by a toolbox and knowledge-augmented code generation. Across MMSI-Bench, MindCube-tiny, OmniSpatial, SPBench, and CV-Bench, GCA achieves state-of-the-art results with average improvements around 37% over strong baselines, and demonstrates robust generalization across multiple foundation VLMs. While it incurs higher computational cost than end-to-end prompts, the approach offers verifiability and a transparent reasoning pathway, and it points to future work on temporal reasoning and trans-normal supervision to train more efficient end-to-end spatial VLMs.
Abstract
Vision Language Models (VLMs) exhibit a fundamental semantic-to-geometric gap in spatial reasoning: they excel at qualitative semantic inference but their reasoning operates within a lossy semantic space, misaligned with high-fidelity geometry. Current paradigms fail to bridge this gap. Training-based methods suffer from an ``oracle paradox,'' learning flawed spatial logic from imperfect oracles. Tool-integrated methods constrain the final computation but critically leave the VLM's planning process unconstrained, resulting in geometrically flawed plans. In this work, we propose Geometrically-Constrained Agent (GCA), a training-free agentic paradigm that resolves this gap by introducing a formal task constraint. Specifically, we strategically decouples the VLM's role into two stages. First, acting as a semantic analyst, the VLM translates the user's ambiguous query into the formal, verifiable task constraint, which defines the reference frame and objective. Second, acting as a task solver, the VLM generates and executes tool calls strictly within the deterministic bounds defined by the constraint. This geometrically-constrained reasoning strategy successfully resolve the semantic-to-geometric gap, yielding a robust and verifiable reasoning pathway for spatial reasoning. Comprehensive experiments demonstrate that GCA achieves SOTA performance on multiple spatial reasoning benchmarks, surpassing existing training-based and tool-integrated methods by ~27%. Please see our homepage at https://gca-spatial-reasoning.github.io.
