SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models

Ruosen Zhao; Zhikang Zhang; Jialei Xu; Jiahao Chang; Dong Chen; Lingyun Li; Weijian Sun; Zizhuang Wei

SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models

Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, Zizhuang Wei

TL;DR

SpaceMind addresses the challenge of 3D spatial reasoning in vision-language models using RGB inputs. It introduces a Camera-Guided Modality Fusion (CGMF) that treats the camera representation as an explicit guiding modality and fuses it with geometry-aware spatial tokens before the language model. The approach, using a dual-encoder setup (InternViT and VGGT) and a SwiGLU-based camera-gated fusion, achieves state-of-the-art results on VSI-Bench and SQA3D and strong performance on SPBench, demonstrating robust spatial grounding without explicit 3D sensors. The work highlights the importance of separating camera/viewpoint information from scene geometry in multimodal fusion, offering a practical and scalable path for spatial reasoning in RGB-only VLMs.

Abstract

Large vision-language models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning, such as distance estimation, size comparison, and cross-view consistency. Existing 3D-aware methods either depend on auxiliary 3D information or enhance RGB-only VLMs with geometry encoders through shallow feature fusion. We propose SpaceMind, a multimodal large language model explicitly designed for spatial reasoning solely from RGB inputs. The model adopts a dual-encoder architecture, integrating VGGT as a spatial understanding encoder and InternViT as a 2D visual encoder. The key idea is to treat the camera representation as an active guiding modality rather than passive metadata. Specifically, SpaceMind introduces a lightweight Camera-Guided Modality Fusion module before the language model to replace shallow fusion. It applies camera-conditioned biasing to spatial tokens, assigns query-independent weights reflecting their geometric importance, and uses the camera embedding to gate the fused representation. Empirically, SpaceMind establishes new state-of-the-art results on VSI-Bench, SQA3D and SPBench, surpassing both open and proprietary systems on VSI-Bench and SPBench by large margins and achieving state-of-the-art performance on SQA3D. These results demonstrate that camera-guided modality fusion is an effective and practical inductive bias for equipping VLMs with genuinely spatially grounded intelligence. We will release code and model checkpoints to support future research.

SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models

TL;DR

Abstract

SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)