Table of Contents
Fetching ...

Text2Loc++: Generalizing 3D Point Cloud Localization from Natural Language

Yan Xia, Letian Shi, Yilin Di, Joao F. Henriques, Daniel Cremers

TL;DR

Text2Loc++ tackles the challenge of localizing 3D point-cloud submaps from natural language by introducing a coarse-to-fine two-stage framework. The global stage uses a Hierarchical Transformer with a frozen language model to capture sentence- and cross-sentence semantics, paired with an attention-based point-cloud encoder, while the fine stage is made matching-free through Prototype-based Map Cloning and Cascaded Cross-Attention. To boost cross-modal alignment and robustness, the paper introduces Masked Instance Training, Modality-aware Hierarchical Contrastive Learning, text distillation, and LoRA-based language tuning, and it demonstrates strong generalization across color and non-color LiDAR data and diverse urban environments. Extensive experiments on KITTI360Pose and a new multi-city dataset show substantial improvements over state-of-the-art methods and reveal robust cross-domain performance, with public code and data to follow.

Abstract

We tackle the problem of localizing 3D point cloud submaps using complex and diverse natural language descriptions, and present Text2Loc++, a novel neural network designed for effective cross-modal alignment between language and point clouds in a coarse-to-fine localization pipeline. To support benchmarking, we introduce a new city-scale dataset covering both color and non-color point clouds from diverse urban scenes, and organize location descriptions into three levels of linguistic complexity. In the global place recognition stage, Text2Loc++ combines a pretrained language model with a Hierarchical Transformer with Max pooling (HTM) for sentence-level semantics, and employs an attention-based point cloud encoder for spatial understanding. We further propose Masked Instance Training (MIT) to filter out non-aligned objects and improve multimodal robustness. To enhance the embedding space, we introduce Modality-aware Hierarchical Contrastive Learning (MHCL), incorporating cross-modal, submap-, text-, and instance-level losses. In the fine localization stage, we completely remove explicit text-instance matching and design a lightweight yet powerful framework based on Prototype-based Map Cloning (PMC) and a Cascaded Cross-Attention Transformer (CCAT). Extensive experiments on the KITTI360Pose dataset show that Text2Loc++ outperforms existing methods by up to 15%. In addition, the proposed model exhibits robust generalization when evaluated on the new dataset, effectively handling complex linguistic expressions and a wide variety of urban environments. The code and dataset will be made publicly available.

Text2Loc++: Generalizing 3D Point Cloud Localization from Natural Language

TL;DR

Text2Loc++ tackles the challenge of localizing 3D point-cloud submaps from natural language by introducing a coarse-to-fine two-stage framework. The global stage uses a Hierarchical Transformer with a frozen language model to capture sentence- and cross-sentence semantics, paired with an attention-based point-cloud encoder, while the fine stage is made matching-free through Prototype-based Map Cloning and Cascaded Cross-Attention. To boost cross-modal alignment and robustness, the paper introduces Masked Instance Training, Modality-aware Hierarchical Contrastive Learning, text distillation, and LoRA-based language tuning, and it demonstrates strong generalization across color and non-color LiDAR data and diverse urban environments. Extensive experiments on KITTI360Pose and a new multi-city dataset show substantial improvements over state-of-the-art methods and reveal robust cross-domain performance, with public code and data to follow.

Abstract

We tackle the problem of localizing 3D point cloud submaps using complex and diverse natural language descriptions, and present Text2Loc++, a novel neural network designed for effective cross-modal alignment between language and point clouds in a coarse-to-fine localization pipeline. To support benchmarking, we introduce a new city-scale dataset covering both color and non-color point clouds from diverse urban scenes, and organize location descriptions into three levels of linguistic complexity. In the global place recognition stage, Text2Loc++ combines a pretrained language model with a Hierarchical Transformer with Max pooling (HTM) for sentence-level semantics, and employs an attention-based point cloud encoder for spatial understanding. We further propose Masked Instance Training (MIT) to filter out non-aligned objects and improve multimodal robustness. To enhance the embedding space, we introduce Modality-aware Hierarchical Contrastive Learning (MHCL), incorporating cross-modal, submap-, text-, and instance-level losses. In the fine localization stage, we completely remove explicit text-instance matching and design a lightweight yet powerful framework based on Prototype-based Map Cloning (PMC) and a Cascaded Cross-Attention Transformer (CCAT). Extensive experiments on the KITTI360Pose dataset show that Text2Loc++ outperforms existing methods by up to 15%. In addition, the proposed model exhibits robust generalization when evaluated on the new dataset, effectively handling complex linguistic expressions and a wide variety of urban environments. The code and dataset will be made publicly available.

Paper Structure

This paper contains 23 sections, 6 figures, 11 tables.

Figures (6)

  • Figure 1: (Left) We present Text2Loc++, a framework developed for cross-city and cross-country position localization based on textual descriptions of varying complexity. Given a 3D point cloud representing the surrounding environment and a textual query describing a location from different urban or regional contexts, Text2Loc++ identifies the most probable position corresponding to the description within the map. The model demonstrates strong generalization across diverse language inputs and heterogeneous point cloud data. (Right) Localization results on the KITTI360Pose test set show that the proposed Text2Loc++ consistently outperforms existing baselines across all top-k retrieval thresholds, achieving up to 15% higher accuracy in localizing text queries within a 5m error range.
  • Figure 2: The proposed Text2Loc++ architecture. It consists of two tandem modules: Global place recognition and Fine localization. Global place recognition. Given a text-based position description, we first identify a set of coarse candidate locations, "submaps," potentially containing the target position. This is achieved by retrieving the top-k nearest submaps from a previously constructed database of submaps using our novel text-to-submap retrieval model. Fine localization. We then refine the center coordinates of the retrieved submaps via our designed matching-free position estimation module, which adjusts the target location to increase accuracy.
  • Figure 3: Samples of point cloud from different datasets. We present visualizations of the point clouds from the four constructed benchmark datasets. Subfigures CARLA - Paris_CARLA, Paris - Paris_CARLA, and Toronto depict datasets with color information, whereas other subfigures correspond to datasets without color attributes.
  • Figure 4: Qualitative localization results on the KITTI360Pose dataset: In global place recognition, the numbers in top3 retrieval submaps represent center distances between retrieved submaps and the ground truth. Green boxes indicate positive submaps containing the target location, while red boxes signify negative submaps. For fine localization, red and black dots represent the ground truth and predicted target locations, with the red number indicating the distance between them. (a), (b), and (c) uses simple, moderate, and complex text descriptions respectively.
  • Figure 5: Robustness analysis comparing different metrics.
  • ...and 1 more figures