Table of Contents
Fetching ...

Instance-free Text to Point Cloud Localization with Relative Position Awareness

Lichao Wang, Zhihao Yuan, Jinke Ren, Shuguang Cui, Zhen Li

TL;DR

This work addresses outdoor city-scale text-to-point-cloud localization without relying on ground-truth instance inputs. It introduces IFRP-T2P, featuring an instance-query extractor and two relative-position aware attention modules (RowColRPA in coarse retrieval and RPCA in fine fusion) to exploit spatial relations among potential instances. On KITTI360Pose, IFRP-T2P achieves competitive localization performance against state-of-the-art methods that require instance inputs, while demonstrating improved robustness to point-cloud sparsity and avoiding error-prone instance segmentation pipelines. The approach enhances cross-modal fusion for practical robot-human collaboration by efficiently leveraging geometric cues in both retrieval and regression stages.

Abstract

Text-to-point-cloud cross-modal localization is an emerging vision-language task critical for future robot-human collaboration. It seeks to localize a position from a city-scale point cloud scene based on a few natural language instructions. In this paper, we address two key limitations of existing approaches: 1) their reliance on ground-truth instances as input; and 2) their neglect of the relative positions among potential instances. Our proposed model follows a two-stage pipeline, including a coarse stage for text-cell retrieval and a fine stage for position estimation. In both stages, we introduce an instance query extractor, in which the cells are encoded by a 3D sparse convolution U-Net to generate the multi-scale point cloud features, and a set of queries iteratively attend to these features to represent instances. In the coarse stage, a row-column relative position-aware self-attention (RowColRPA) module is designed to capture the spatial relations among the instance queries. In the fine stage, a multi-modal relative position-aware cross-attention (RPCA) module is developed to fuse the text and point cloud features along with spatial relations for improving fine position estimation. Experiment results on the KITTI360Pose dataset demonstrate that our model achieves competitive performance with the state-of-the-art models without taking ground-truth instances as input.

Instance-free Text to Point Cloud Localization with Relative Position Awareness

TL;DR

This work addresses outdoor city-scale text-to-point-cloud localization without relying on ground-truth instance inputs. It introduces IFRP-T2P, featuring an instance-query extractor and two relative-position aware attention modules (RowColRPA in coarse retrieval and RPCA in fine fusion) to exploit spatial relations among potential instances. On KITTI360Pose, IFRP-T2P achieves competitive localization performance against state-of-the-art methods that require instance inputs, while demonstrating improved robustness to point-cloud sparsity and avoiding error-prone instance segmentation pipelines. The approach enhances cross-modal fusion for practical robot-human collaboration by efficiently leveraging geometric cues in both retrieval and regression stages.

Abstract

Text-to-point-cloud cross-modal localization is an emerging vision-language task critical for future robot-human collaboration. It seeks to localize a position from a city-scale point cloud scene based on a few natural language instructions. In this paper, we address two key limitations of existing approaches: 1) their reliance on ground-truth instances as input; and 2) their neglect of the relative positions among potential instances. Our proposed model follows a two-stage pipeline, including a coarse stage for text-cell retrieval and a fine stage for position estimation. In both stages, we introduce an instance query extractor, in which the cells are encoded by a 3D sparse convolution U-Net to generate the multi-scale point cloud features, and a set of queries iteratively attend to these features to represent instances. In the coarse stage, a row-column relative position-aware self-attention (RowColRPA) module is designed to capture the spatial relations among the instance queries. In the fine stage, a multi-modal relative position-aware cross-attention (RPCA) module is developed to fuse the text and point cloud features along with spatial relations for improving fine position estimation. Experiment results on the KITTI360Pose dataset demonstrate that our model achieves competitive performance with the state-of-the-art models without taking ground-truth instances as input.
Paper Structure (25 sections, 16 equations, 10 figures, 7 tables)

This paper contains 25 sections, 16 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Overview of text-to-point cloud localization task. Different from visual-based place recognition task, the objective of our work is to identify a specific position within a city-scale point cloud scene based on a text query.
  • Figure 2: Illustration of our coarse-to-fine pipeline. Our approach processes raw point clouds of cells directly, and uses queries to represent potential instances. Firstly, the coarse stage involves selecting potential target-holding candidate cells through the retrieval of the top-$k$ cells from a pre-established cell database. Subsequently, the fine stage fuses the multi-modal features and refines the center coordinates of the selected cells.
  • Figure 3: Illustration of (a) instance query extraction, (b) mask module, (c) transformer decoder, (d) query enhance module, and (e) hint encoder.
  • Figure 4: Comparison of our row-column relative position aware self-attention (RowColRPA), the relation-enhanced self-attention (REA) in RET, and vanilla self-attention.
  • Figure 5: Illustration of the relative position-aware multi-modal fusion module. The relative-position-aware cross attention (RPCA) merges potential instance features with text keys and values, infusing semantic and spatial relation information with text embeddings.
  • ...and 5 more figures