Table of Contents
Fetching ...

Fast 3D point clouds retrieval for Large-scale 3D Place Recognition

Chahine-Nicolas Zede, Laurent Carrafa, Valérie Gouet-Brunet

TL;DR

This work tackles scalable retrieval for 3D point clouds in large-scale LiDAR-based place recognition by adapting the Differentiable Search Index (DSI) to 3D data. It maps 3D point-cloud descriptors to 1D docids using a Vision Transformer-based captioning step, augmented with positional and semantic encoding, achieving near $O(1)$ retrieval. The proposed DSI-3D framework introduces new docid representations, notably Positional Structured identifiers and Hilbert-curve indexing, and demonstrates retrieval performance competitive with state-of-the-art methods while greatly reducing query time on KITTI datasets. The results indicate that Hilbert-based indexing offers the best trade-off between retrieval quality and speed, highlighting the method’s potential for real-time, large-scale 3D place recognition.

Abstract

Retrieval in 3D point clouds is a challenging task that consists in retrieving the most similar point clouds to a given query within a reference of 3D points. Current methods focus on comparing descriptors of point clouds in order to identify similar ones. Due to the complexity of this latter step, here we focus on the acceleration of the retrieval by adapting the Differentiable Search Index (DSI), a transformer-based approach initially designed for text information retrieval, for 3D point clouds retrieval. Our approach generates 1D identifiers based on the point descriptors, enabling direct retrieval in constant time. To adapt DSI to 3D data, we integrate Vision Transformers to map descriptors to these identifiers while incorporating positional and semantic encoding. The approach is evaluated for place recognition on a public benchmark comparing its retrieval capabilities against state-of-the-art methods, in terms of quality and speed of returned point clouds.

Fast 3D point clouds retrieval for Large-scale 3D Place Recognition

TL;DR

This work tackles scalable retrieval for 3D point clouds in large-scale LiDAR-based place recognition by adapting the Differentiable Search Index (DSI) to 3D data. It maps 3D point-cloud descriptors to 1D docids using a Vision Transformer-based captioning step, augmented with positional and semantic encoding, achieving near retrieval. The proposed DSI-3D framework introduces new docid representations, notably Positional Structured identifiers and Hilbert-curve indexing, and demonstrates retrieval performance competitive with state-of-the-art methods while greatly reducing query time on KITTI datasets. The results indicate that Hilbert-based indexing offers the best trade-off between retrieval quality and speed, highlighting the method’s potential for real-time, large-scale 3D place recognition.

Abstract

Retrieval in 3D point clouds is a challenging task that consists in retrieving the most similar point clouds to a given query within a reference of 3D points. Current methods focus on comparing descriptors of point clouds in order to identify similar ones. Due to the complexity of this latter step, here we focus on the acceleration of the retrieval by adapting the Differentiable Search Index (DSI), a transformer-based approach initially designed for text information retrieval, for 3D point clouds retrieval. Our approach generates 1D identifiers based on the point descriptors, enabling direct retrieval in constant time. To adapt DSI to 3D data, we integrate Vision Transformers to map descriptors to these identifiers while incorporating positional and semantic encoding. The approach is evaluated for place recognition on a public benchmark comparing its retrieval capabilities against state-of-the-art methods, in terms of quality and speed of returned point clouds.

Paper Structure

This paper contains 19 sections, 6 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Differentiable search index pipeline tay_transformer_2022 adapted for 3D data retrieval using GIT architecture wang_git_2022. (a) Labeling step: for each input of the database (red point cloud), the text decoder learns to generate the label according to the encoded point cloud. (b) Retrieval: given a point cloud as a query (black point cloud), a beam search is performed by the text decoder to provide the $n$ most probable solutions.