Table of Contents
Fetching ...

VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models

Shuhao Kang, Youqi Liao, Peijie Wang, Wenlong Liao, Qilin Zhang, Benjamin Busam, Xieyuanli Chen, Yun Liu

TL;DR

This work proposes VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization by transforming point clouds into bird's-eye-view images and scene graphs that jointly encode geometric and semantic context.

Abstract

Text-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird's-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mechanism that explicitly associates textual cues with scene graph nodes, enabling interpretable spatial reasoning for accurate localization. To facilitate systematic evaluation across diverse scenes, we present CityLoc, a benchmark built from multi-source point clouds for fine-grained T2P localization. Experiments on CityLoc demonstrate VLM-Loc achieves superior accuracy and robustness compared to state-of-the-art methods. Our code, model, and dataset are available at \href{https://github.com/MCG-NKU/nku-3d-vision}{repository}.

VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models

TL;DR

This work proposes VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization by transforming point clouds into bird's-eye-view images and scene graphs that jointly encode geometric and semantic context.

Abstract

Text-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird's-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mechanism that explicitly associates textual cues with scene graph nodes, enabling interpretable spatial reasoning for accurate localization. To facilitate systematic evaluation across diverse scenes, we present CityLoc, a benchmark built from multi-source point clouds for fine-grained T2P localization. Experiments on CityLoc demonstrate VLM-Loc achieves superior accuracy and robustness compared to state-of-the-art methods. Our code, model, and dataset are available at \href{https://github.com/MCG-NKU/nku-3d-vision}{repository}.
Paper Structure (28 sections, 4 equations, 10 figures, 9 tables)

This paper contains 28 sections, 4 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: (a) illustrates the human-like logic behind text-to-point cloud localization, where spatial descriptions are used to infer the target position. (b) and (c) show the architectures of a typical method, Text2Loc xia2024text2loc, and our proposed VLM-Loc, respectively.
  • Figure 2: Overview of VLM-Loc. In the data generation stage, the point cloud map is converted into a BEV image and a scene graph, where each node encodes semantic and spatial information. During training, the BEV image is used as the visual input, and the text input includes the scene graph, system prompt, and text query. These inputs are fed into a VLM for fine-tuning, enabling it to perform partial node assignment and position estimation in an autoregressive manner.
  • Figure 3: Illustration of the node assignment process. PNA determines whether a textual object is groundable by comparing the distance between points A and B with the threshold $\tau$.
  • Figure 4: Relationship between localization error and the number of correctly assigned nodes on the CityLoc-K test set. More correct node assignments correspond to lower localization errors.
  • Figure 5: Qualitative results of VLM-Loc and baseline methods on the CityLoc-K. Each example visualizes the predicted and GT positions on colorized BEV maps rendered with semantic labels. The red circles ● and black circles ● denote the GT and predicted positions, respectively. The localization error is shown below each image, and green/red borders indicate localization error below/above 5 m.
  • ...and 5 more figures