Table of Contents
Fetching ...

GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning

Ruiheng Liu, Haihong Hao, Mingfei Han, Xin Gu, Kecheng Zhang, Changlin Li, Xiaojun Chang

TL;DR

This framework endows the model with an awareness of perceptual insufficiency, empowering it to autonomously engage geometric features in reasoning when 2D cues are deemed insufficient, offering a path toward more robust, efficient and self-aware multi-modal intelligence.

Abstract

Advancing towards artificial superintelligence requires rich and intelligent perceptual capabilities. A critical frontier in this pursuit is overcoming the limited spatial understanding of Multimodal Large Language Models (MLLMs), where geometry information is essential. Existing methods often address this by rigidly injecting geometric signals into every input, while ignoring their necessity and adding computation overhead. Contrary to this paradigm, our framework endows the model with an awareness of perceptual insufficiency, empowering it to autonomously engage geometric features in reasoning when 2D cues are deemed insufficient. To achieve this, we first introduce an independent geometry input channel to the model architecture and conduct alignment training, enabling the effective utilization of geometric features. Subsequently, to endow the model with perceptual awareness, we curate a dedicated spatial-aware supervised fine-tuning dataset. This serves to activate the model's latent internal cues, empowering it to autonomously determine the necessity of geometric information. Experiments across multiple spatial reasoning benchmarks validate this approach, demonstrating significant spatial gains without compromising 2D visual reasoning capabilities, offering a path toward more robust, efficient and self-aware multi-modal intelligence.

GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning

TL;DR

This framework endows the model with an awareness of perceptual insufficiency, empowering it to autonomously engage geometric features in reasoning when 2D cues are deemed insufficient, offering a path toward more robust, efficient and self-aware multi-modal intelligence.

Abstract

Advancing towards artificial superintelligence requires rich and intelligent perceptual capabilities. A critical frontier in this pursuit is overcoming the limited spatial understanding of Multimodal Large Language Models (MLLMs), where geometry information is essential. Existing methods often address this by rigidly injecting geometric signals into every input, while ignoring their necessity and adding computation overhead. Contrary to this paradigm, our framework endows the model with an awareness of perceptual insufficiency, empowering it to autonomously engage geometric features in reasoning when 2D cues are deemed insufficient. To achieve this, we first introduce an independent geometry input channel to the model architecture and conduct alignment training, enabling the effective utilization of geometric features. Subsequently, to endow the model with perceptual awareness, we curate a dedicated spatial-aware supervised fine-tuning dataset. This serves to activate the model's latent internal cues, empowering it to autonomously determine the necessity of geometric information. Experiments across multiple spatial reasoning benchmarks validate this approach, demonstrating significant spatial gains without compromising 2D visual reasoning capabilities, offering a path toward more robust, efficient and self-aware multi-modal intelligence.
Paper Structure (22 sections, 3 equations, 5 figures, 5 tables)

This paper contains 22 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Adaptive Geometric Reasoning with GeoSense. (Top) Existing MLLMs typically adopt a static approach to 3D geometry, either ignoring it or rigidly fusing it, which leads to confusion in general tasks or failures in spatial reasoning. (Bottom) GeoSense introduces an adaptive mechanism (Use with Sense) that requests geometric features only when necessary. As shown in the radar chart, this flexibility allows GeoSense to achieve SOTA performance across both general visual benchmarks (e.g., MMBench liu2024mmbenchmultimodalmodelallaround, WeMath qiao2024wemathdoeslargemultimodal) and spatial reasoning tasks (e.g., VSI-Bench yang2025thinking and MindCube yin2025spatialmentalmodelinglimited).
  • Figure 2: Architectural Overview of GeoSense. We integrate a 3D visual geometry encoder alongside a standard 2D visual encoder, both of which are kept frozen to preserve pretrained representations. Dedicated projection layers map these features into a unified embedding space for the LLM backbone. During inference, the model dynamically makes an "Internal Sense Decision" based on the 2D visual and textual prompt. If the latent state triggers a geometry request (e.g., via the <vggt> token), 3D embeddings are concatenated to the sequence for a second re-inference pass. During training, only the projection layers and the LLM backbone are optimized.
  • Figure 3: Sample distribution of the perception dataset by training objective. Consistent refers to samples with prediction invariance to 3D features, kept as is. Strategy A and Strategy B represent data where 3D geometric cues are essential and where they should be taken as noise, respectively.
  • Figure 4: Confidence scores of 3D trigger token across spatial and general tasks.
  • Figure 5: Case study of internal sense decision. We present representative examples demonstrating how our model adaptively determines whether to trigger 3D geometric features according to the input and task. (a, b): Rare cases where general tasks explicitly demand geometric embedding. (c): Typical activation for spatial reasoning. (d, e): Standard suppression for general visual inputs. (f): Spatial queries solved effectively without geometric triggers.