Table of Contents
Fetching ...

TraIL-Det: Transformation-Invariant Local Feature Networks for 3D LiDAR Object Detection with Unsupervised Pre-Training

Li Li, Tanqiu Qiao, Hubert P. H. Shum, Toby P. Breckon

TL;DR

This work introduces Transformation-Invariant Local (TraIL) features and the associated TraIL-Det architecture, and proposes a Multi-head self-Attention Encoder with asymmetric geometric features to encode high-dimensional TraIL features into manageable representations.

Abstract

3D point clouds are essential for perceiving outdoor scenes, especially within the realm of autonomous driving. Recent advances in 3D LiDAR Object Detection focus primarily on the spatial positioning and distribution of points to ensure accurate detection. However, despite their robust performance in variable conditions, these methods are hindered by their sole reliance on coordinates and point intensity, resulting in inadequate isometric invariance and suboptimal detection outcomes. To tackle this challenge, our work introduces Transformation-Invariant Local (TraIL) features and the associated TraIL-Det architecture. Our TraIL features exhibit rigid transformation invariance and effectively adapt to variations in point density, with a design focus on capturing the localized geometry of neighboring structures. They utilize the inherent isotropic radiation of LiDAR to enhance local representation, improve computational efficiency, and boost detection performance. To effectively process the geometric relations among points within each proposal, we propose a Multi-head self-Attention Encoder (MAE) with asymmetric geometric features to encode high-dimensional TraIL features into manageable representations. Our method outperforms contemporary self-supervised 3D object detection approaches in terms of mAP on KITTI (67.8, 20% label, moderate) and Waymo (68.9, 20% label, moderate) datasets under various label ratios (20%, 50%, and 100%).

TraIL-Det: Transformation-Invariant Local Feature Networks for 3D LiDAR Object Detection with Unsupervised Pre-Training

TL;DR

This work introduces Transformation-Invariant Local (TraIL) features and the associated TraIL-Det architecture, and proposes a Multi-head self-Attention Encoder with asymmetric geometric features to encode high-dimensional TraIL features into manageable representations.

Abstract

3D point clouds are essential for perceiving outdoor scenes, especially within the realm of autonomous driving. Recent advances in 3D LiDAR Object Detection focus primarily on the spatial positioning and distribution of points to ensure accurate detection. However, despite their robust performance in variable conditions, these methods are hindered by their sole reliance on coordinates and point intensity, resulting in inadequate isometric invariance and suboptimal detection outcomes. To tackle this challenge, our work introduces Transformation-Invariant Local (TraIL) features and the associated TraIL-Det architecture. Our TraIL features exhibit rigid transformation invariance and effectively adapt to variations in point density, with a design focus on capturing the localized geometry of neighboring structures. They utilize the inherent isotropic radiation of LiDAR to enhance local representation, improve computational efficiency, and boost detection performance. To effectively process the geometric relations among points within each proposal, we propose a Multi-head self-Attention Encoder (MAE) with asymmetric geometric features to encode high-dimensional TraIL features into manageable representations. Our method outperforms contemporary self-supervised 3D object detection approaches in terms of mAP on KITTI (67.8, 20% label, moderate) and Waymo (68.9, 20% label, moderate) datasets under various label ratios (20%, 50%, and 100%).
Paper Structure (14 sections, 5 equations, 3 figures, 3 tables)

This paper contains 14 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Our proposed TraIL architecture for 3D object detection leverages TraIL features from the point cloud. ➊ We take point cloud inputs as input and augment them with differing views. ➋ The augmented point clouds are sampled to the initial paired region proposals. ➌ The encoding module (TraIL MAE) extracts expressive proposal representations by considering the geometric relations among points within each proposal. ➍ We extract the concatenated features with the Multi-Head Attention Encoding Module (TraIL MAE). ➎ Inter-Proposal Discrimination (IPD) and Inter-Cluster Separation (ICS), D&S module yin2022proposalcontrast are subsequently enforced to optimize the whole network.
  • Figure 2: Multi-attention geometric encoding. The asymmetric geometric features are computed from the proposal $P^{*}$, specifically the center and neighbor points, through a subtraction operator. The geometric features are further refined by a proposal-aware encoding module that utilizes a multi-head self-attention mechanism.
  • Figure 3: The qualitative results of 3D object detection with our TraIL-Det on the KITTI dataset. The predicted 3D bounding boxes are marked within the point cloud frame, while the corresponding 2D bounding boxes are highlighted in the RGB images. In the point cloud visualization, white points represent those within the camera field of view (FOV), whereas purple points indicate those outside the camera FOV. Best viewed in color.