Table of Contents
Fetching ...

Text2Loc: 3D Point Cloud Localization from Natural Language

Yan Xia, Letian Shi, Zifeng Ding, João F. Henriques, Daniel Cremers

TL;DR

This work introduces a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text and proposes a novel matching-free fine localization method, which completely removes the need for complicated text-instance matching and is lighter, faster, and more accurate than previous methods.

Abstract

We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions and introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text. Text2Loc follows a coarse-to-fine localization pipeline: text-submap global place recognition, followed by fine localization. In global place recognition, relational dynamics among each textual hint are captured in a hierarchical transformer with max-pooling (HTM), whereas a balance between positive and negative pairs is maintained using text-submap contrastive learning. Moreover, we propose a novel matching-free fine localization method to further refine the location predictions, which completely removes the need for complicated text-instance matching and is lighter, faster, and more accurate than previous methods. Extensive experiments show that Text2Loc improves the localization accuracy by up to $2\times$ over the state-of-the-art on the KITTI360Pose dataset. Our project page is publicly available at \url{https://yan-xia.github.io/projects/text2loc/}.

Text2Loc: 3D Point Cloud Localization from Natural Language

TL;DR

This work introduces a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text and proposes a novel matching-free fine localization method, which completely removes the need for complicated text-instance matching and is lighter, faster, and more accurate than previous methods.

Abstract

We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions and introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text. Text2Loc follows a coarse-to-fine localization pipeline: text-submap global place recognition, followed by fine localization. In global place recognition, relational dynamics among each textual hint are captured in a hierarchical transformer with max-pooling (HTM), whereas a balance between positive and negative pairs is maintained using text-submap contrastive learning. Moreover, we propose a novel matching-free fine localization method to further refine the location predictions, which completely removes the need for complicated text-instance matching and is lighter, faster, and more accurate than previous methods. Extensive experiments show that Text2Loc improves the localization accuracy by up to over the state-of-the-art on the KITTI360Pose dataset. Our project page is publicly available at \url{https://yan-xia.github.io/projects/text2loc/}.
Paper Structure (28 sections, 3 equations, 7 figures, 8 tables)

This paper contains 28 sections, 3 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: (Left) We introduce Text2Loc, a solution designed for city-scale position localization using textual descriptions. When provided with a point cloud representing the surroundings and a textual query describing a position, Text2Loc determines the most probable location of that described position within the map. (Right) Localization performance on the KITTI360Pose test set. The proposed Text2Loc achieves consistently better performance across all top retrieval numbers. Notably, it outperforms the best baseline by up to 2 times, localizing text queries below 5m.
  • Figure 2: The proposed Text2Loc architecture. It consists of two tandem modules: Global place recognition and Fine localization. Global place recognition. Given a text-based position description, we first identify a set of coarse candidate locations, "submaps," potentially containing the target position. This is achieved by retrieving the top-k nearest submaps from a previously constructed database of submaps using our novel text-to-submap retrieval model. Fine localization. We then refine the center coordinates of the retrieved submaps via our designed matching-free position estimation module, which adjusts the target location to increase accuracy.
  • Figure 3: Qualitative localization results on the KITTI360Pose dataset: In global place recognition, the numbers in top3 retrieval submaps represent center distances between retrieved submaps and the ground truth. Green boxes indicate positive submaps containing the target location, while red boxes signify negative submaps. For fine localization, red and black dots represent the ground truth and predicted target locations, with the red number indicating the distance between them.
  • Figure 4: T-SNE visualization for the global place recognition.
  • Figure 5: Robust analysis of our Text2Loc on the KITTI360Pose Benchmark. We present the top-3 retrieved submaps in global place recognition and the final predicted location for both the original query text descriptions and the modified queries (in red).
  • ...and 2 more figures