Table of Contents
Fetching ...

That's My Point: Compact Object-centric LiDAR Pose Estimation for Large-scale Outdoor Localisation

Georgi Pramatarov, Matthew Gadd, Paul Newman, Daniele De Martini

TL;DR

An object-matching network based on self- and cross-correlation that captures geometric and semantic relationships between entities is introduced that is comparable with state-of-the-art methods with almost half the representation size, specifically 1.33 kB on average.

Abstract

This paper is about 3D pose estimation on LiDAR scans with extremely minimal storage requirements to enable scalable mapping and localisation. We achieve this by clustering all points of segmented scans into semantic objects and representing them only with their respective centroid and semantic class. In this way, each LiDAR scan is reduced to a compact collection of four-number vectors. This abstracts away important structural information from the scenes, which is crucial for traditional registration approaches. To mitigate this, we introduce an object-matching network based on self- and cross-correlation that captures geometric and semantic relationships between entities. The respective matches allow us to recover the relative transformation between scans through weighted Singular Value Decomposition (SVD) and RANdom SAmple Consensus (RANSAC). We demonstrate that such representation is sufficient for metric localisation by registering point clouds taken under different viewpoints on the KITTI dataset, and at different periods of time localising between KITTI and KITTI-360. We achieve accurate metric estimates comparable with state-of-the-art methods with almost half the representation size, specifically 1.33 kB on average.

That's My Point: Compact Object-centric LiDAR Pose Estimation for Large-scale Outdoor Localisation

TL;DR

An object-matching network based on self- and cross-correlation that captures geometric and semantic relationships between entities is introduced that is comparable with state-of-the-art methods with almost half the representation size, specifically 1.33 kB on average.

Abstract

This paper is about 3D pose estimation on LiDAR scans with extremely minimal storage requirements to enable scalable mapping and localisation. We achieve this by clustering all points of segmented scans into semantic objects and representing them only with their respective centroid and semantic class. In this way, each LiDAR scan is reduced to a compact collection of four-number vectors. This abstracts away important structural information from the scenes, which is crucial for traditional registration approaches. To mitigate this, we introduce an object-matching network based on self- and cross-correlation that captures geometric and semantic relationships between entities. The respective matches allow us to recover the relative transformation between scans through weighted Singular Value Decomposition (SVD) and RANdom SAmple Consensus (RANSAC). We demonstrate that such representation is sufficient for metric localisation by registering point clouds taken under different viewpoints on the KITTI dataset, and at different periods of time localising between KITTI and KITTI-360. We achieve accurate metric estimates comparable with state-of-the-art methods with almost half the representation size, specifically 1.33 kB on average.
Paper Structure (23 sections, 5 equations, 3 figures, 3 tables)

This paper contains 23 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Method overview. LiDAR scans are represented extremely compactly by only the centroid and semantic class of the corresponding objects in the scene. Sacrificing information in this way, we learn a robust matching function which leverages the remaining geometry in the object scene structure as well as the semantic relationships between entities.
  • Figure 2: System diagram. Our method aims to register a semantically labelled query point cloud $P_s$ with a map. It clusters $P_s$ into an object set $O_s$, keeping only instance centroids and semantic labels. Then, the query and map sets are passed through a semantic embedding and feature extraction module. The resulting object features $F_s$ and $F_m$ are then passed through a geometric self- and feature-based cross-attention matching module, producing a cross-correlation similarity matrix $\bar{S}$. A semantic mask is then applied to filter erroneous matches, resulting in the final object correspondences.
  • Figure 3: KITTI-360 sequence 09, registered on KITTI sequence 07.