Table of Contents
Fetching ...

Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, Hang Xu, Xiaodan Liang

TL;DR

GeoThinker introduces active geometry integration for spatial reasoning in Multimodal Large Language Models by deploying Spatial-Grounded Fusion (SGF) and Importance Gating to selectively query and inject geometry conditioned on internal reasoning. By performing frame-wise cross-attention between semantic tokens and per-frame geometry cues and regulating attention with learned gating and a global scaling factor, GeoThinker achieves state-of-the-art spatial performance (VSI-Bench peak 72.6) and robust debiased results, while maintaining efficiency through spatial compression and careful layer selection. The approach transfers to challenging downstream tasks, improving embodied referring (RoboRefer) and autonomous driving planning (ReCogDrive), and its ablations confirm the additive benefits of SGF, frame-wise constraints, and gating. Overall, the work demonstrates that active, semantics-driven geometry integration is crucial for advancing spatial intelligence in next-generation MLLMs, with practical implications for real-world navigation and interaction tasks.

Abstract

Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.

Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

TL;DR

GeoThinker introduces active geometry integration for spatial reasoning in Multimodal Large Language Models by deploying Spatial-Grounded Fusion (SGF) and Importance Gating to selectively query and inject geometry conditioned on internal reasoning. By performing frame-wise cross-attention between semantic tokens and per-frame geometry cues and regulating attention with learned gating and a global scaling factor, GeoThinker achieves state-of-the-art spatial performance (VSI-Bench peak 72.6) and robust debiased results, while maintaining efficiency through spatial compression and careful layer selection. The approach transfers to challenging downstream tasks, improving embodied referring (RoboRefer) and autonomous driving planning (ReCogDrive), and its ablations confirm the additive benefits of SGF, frame-wise constraints, and gating. Overall, the work demonstrates that active, semantics-driven geometry integration is crucial for advancing spatial intelligence in next-generation MLLMs, with practical implications for real-world navigation and interaction tasks.

Abstract

Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.
Paper Structure (48 sections, 5 equations, 8 figures, 12 tables)

This paper contains 48 sections, 5 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Thinking with geometry through active integration.Left: (a) Passive Fusion: Conventional MLLMs indiscriminately incorporate a global stream of geometric features, which leads to significant information redundancy and semantic-texture misalignment. (b) Active Perception (GeoThinker): Our framework shifts the paradigm by empowering the model to discern and selectively retrieve spatial cues guided by its internal reasoning demands. Right: Active perception yields superior performance across diverse spatial intelligence benchmarks.
  • Figure 2: Comparison of geometry integration paradigms. (a) and (b) represent passive paradigms that indiscriminately incorporate geometric streams, often leading to semantic-geometry misalignment and redundant noise. In contrast, (c) GeoThinker shifts to active perception, empowering the MLLM to autonomously discern and selectively retrieve task-related geometric cues guided by internal reasoning.
  • Figure 3: Overview of the GeoThinker architecture. Our framework features a decoupled interaction mechanism where the VGGT is integrated via Spatial-Grounded Fusion layers. By employing Importance Gating, the model predicts a localized attention bias to dynamically modulate the injection of geometric textures. This design ensures that rich structural details are only queried when they are contextually relevant to the semantic reasoning process.
  • Figure 4: Visualization of Importance Gating Scores. Heatmaps illustrate that GeoThinker naturally learns to prioritize salient object boundaries and structural edges while suppressing non-informative regions like floors or walls.
  • Figure 5: Computational cost comparison of FLOPs and inference latency.
  • ...and 3 more figures