Table of Contents
Fetching ...

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Kevin Qu, Haozhe Qi, Mihai Dusmanu, Mahdi Rad, Rui Wang, Marc Pollefeys

Abstract

Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Abstract

Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm
Paper Structure (26 sections, 14 equations, 4 figures, 8 tables)

This paper contains 26 sections, 14 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Loc3R-VLM equips 2D VLMs with advanced 3D spatial understanding capabilities from video. Inspired by human cognition, it builds an internal cognitive map of the global environment while explicitly modeling an agent's position and orientation. By jointly capturing global layout and egocentric state, the model excels at two core tasks: language-driven localization and viewpoint-aware 3D reasoning.
  • Figure 2: Overview of Loc3R-VLM. Our framework takes a monocular video as input and augments the vision token sequence with per-frame latent camera pose priors extracted from the 3D foundation model CUT3R wang2025cut3r. The model is jointly trained using two spatial objectives: (1) layout reconstruction, which grounds vision patch tokens into a bird's-eye-view (BEV) space to capture global scene structure, and (2) situation modeling, which utilizes dedicated localization query tokens to localize an agent from a situation description. During answer generation, the model leverages the inferred layout and location to perform viewpoint-aware 3D reasoning.
  • Figure 3: Spatial Supervision Framework introduces complementary training signals. For the layout reconstruction objective, the model learns to ground each vision patch token onto its corresponding BEV coordinate in a cognitive map to capture global scene structure. For localization, dedicated localization tokens explicitly model the agent's position and orientation. The framework is trained end-to-end using a joint objective of layout, localization, and language losses.
  • Figure 4: Qualitative Results for language-based localization and situated QA on SQA3D ma2022sqa3d. Loc3R-VLM accurately grounds the described situations (blue: prediction, green: ground truth) and provides the correct viewpoint-dependent answer. Meshes are shown for visualization only and are not used by the model.