Table of Contents
Fetching ...

Make Geometry Matter for Spatial Reasoning

Shihua Zhang, Qiuhong Shen, Shizun Wang, Tianbo Pan, Xinchao Wang

Abstract

Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.

Make Geometry Matter for Spatial Reasoning

Abstract

Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.

Paper Structure

This paper contains 18 sections, 15 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Geometry injection can be underutilized and even harmful for spatial reasoning. We compare three variants: w/o Geo. (removing the geometry branch), w/ Geo. (injecting geometry tokens via naive fusion with standard fine-tuning), and Ours. The plots reveal a counterintuitive yet reproducible pattern: in static scenes, w/ Geo. brings only marginal gains over w/o Geo., whereas in dynamic videos, w/ Geo. can even underperform the w/o Geo. baseline. This suggests that VLMs often fall back to appearance-driven shortcuts in 2D visual tokens and treat geometry as a dispensable side signal. These observations motivate GeoSR (Ours), which compels models to use geometry as actionable evidence and yields consistent improvements across both static and dynamic spatial reasoning.
  • Figure 2: Overview of the Geometry-Aware Framework for spatial reasoning. It augments a VLM with an additional geometry branch. A pretrained geometric tokenizer extracts geometry tokens $\boldsymbol{F}_{\mathrm{G}}$ from the input video, which are fused with standard vision tokens $\boldsymbol{F}_{\mathrm{V}}$ by a fusion module to form $\boldsymbol{F}$. The VLM answers the query based on the fused evidence together with the text tokens $\boldsymbol{F}_{\mathrm{P}}$. Snow and flame icons denote frozen and trainable components, respectively. This paradigm serves as our baseline. Our key observation is that naive fusion under standard fine-tuning can leave $\boldsymbol{F}_{\mathrm{G}}$ underutilized, which motivates GeoSR to make geometry really matter for spatial reasoning.
  • Figure 3: Overview of the proposed strategies in GeoSR. (a) Geometry-Unleashing Masking suppresses appearance shortcuts during training by masking a subset of vision tokens $\boldsymbol{F}_{\mathrm{V}}$. For static settings, the mask is sampled randomly. For dynamic settings, bottleneck tokens $\boldsymbol{B}$ first attend to text tokens $\boldsymbol{F}_{\mathrm{P}}$ to obtain $\boldsymbol{F}_{\mathrm{B}}$, which then attend to geometry tokens $\boldsymbol{F}_{\mathrm{G}}$ to produce question-relevant geometry evidence $\boldsymbol{Z}_{\mathrm{G}}$ and a relevance score used for TopK masking. (b) Geometry-Guided Fusion redistributes the compact evidence $\boldsymbol{Z}_{\mathrm{G}}$ back to token-level geometry features (if applicable) and applies a learned gate $\boldsymbol{\alpha}$ to control the contributions of masked vision features $\tilde{\boldsymbol{F}}_{\mathrm{V}}$ and geometry features $\tilde{\boldsymbol{F}}_{\mathrm{G}}$, producing fused tokens $\boldsymbol{F}$ for the VLM backbone.
  • Figure 4: Visualization of the static spatial reasoning results on VSI-Bench yang2025thinking.
  • Figure 5: Visualization of the dynamic spatial reasoning results on DSR-Bench zhou2026learning.
  • ...and 2 more figures