Table of Contents
Fetching ...

SALSA: Swift Adaptive Lightweight Self-Attention for Enhanced LiDAR Place Recognition

Raktim Gautam Goswami, Naman Patel, Prashanth Krishnamurthy, Farshad Khorrami

TL;DR

LiDAR place recognition is critical for SLAM relocalization and loop closure but demands both accuracy and efficiency. SALSA delivers a lightweight solution by fusing a SphereFormer backbone with radial window attention, an adaptive attention pooling module, and an MLP-Mixer-based token aggregator, followed by PCA whitening; re-ranking with SpectralGV further improves geometric consistency. The approach achieves state-of-the-art retrieval and 6-DoF localization across six large-scale datasets while maintaining real-time performance and a small memory footprint, with extensive ablations showing the effectiveness of the adaptive pooling and radial attention. The work demonstrates SALSA’s robustness to rotation and occlusion and shows practical integration into LiDAR-SLAM pipelines for loop closure and pose-graph optimization.

Abstract

Large-scale LiDAR mappings and localization leverage place recognition techniques to mitigate odometry drifts, ensuring accurate mapping. These techniques utilize scene representations from LiDAR point clouds to identify previously visited sites within a database. Local descriptors, assigned to each point within a point cloud, are aggregated to form a scene representation for the point cloud. These descriptors are also used to re-rank the retrieved point clouds based on geometric fitness scores. We propose SALSA, a novel, lightweight, and efficient framework for LiDAR place recognition. It consists of a Sphereformer backbone that uses radial window attention to enable information aggregation for sparse distant points, an adaptive self-attention layer to pool local descriptors into tokens, and a multi-layer-perceptron Mixer layer for aggregating the tokens to generate a scene descriptor. The proposed framework outperforms existing methods on various LiDAR place recognition datasets in terms of both retrieval and metric localization while operating in real-time.

SALSA: Swift Adaptive Lightweight Self-Attention for Enhanced LiDAR Place Recognition

TL;DR

LiDAR place recognition is critical for SLAM relocalization and loop closure but demands both accuracy and efficiency. SALSA delivers a lightweight solution by fusing a SphereFormer backbone with radial window attention, an adaptive attention pooling module, and an MLP-Mixer-based token aggregator, followed by PCA whitening; re-ranking with SpectralGV further improves geometric consistency. The approach achieves state-of-the-art retrieval and 6-DoF localization across six large-scale datasets while maintaining real-time performance and a small memory footprint, with extensive ablations showing the effectiveness of the adaptive pooling and radial attention. The work demonstrates SALSA’s robustness to rotation and occlusion and shows practical integration into LiDAR-SLAM pipelines for loop closure and pose-graph optimization.

Abstract

Large-scale LiDAR mappings and localization leverage place recognition techniques to mitigate odometry drifts, ensuring accurate mapping. These techniques utilize scene representations from LiDAR point clouds to identify previously visited sites within a database. Local descriptors, assigned to each point within a point cloud, are aggregated to form a scene representation for the point cloud. These descriptors are also used to re-rank the retrieved point clouds based on geometric fitness scores. We propose SALSA, a novel, lightweight, and efficient framework for LiDAR place recognition. It consists of a Sphereformer backbone that uses radial window attention to enable information aggregation for sparse distant points, an adaptive self-attention layer to pool local descriptors into tokens, and a multi-layer-perceptron Mixer layer for aggregating the tokens to generate a scene descriptor. The proposed framework outperforms existing methods on various LiDAR place recognition datasets in terms of both retrieval and metric localization while operating in real-time.
Paper Structure (27 sections, 4 equations, 9 figures, 7 tables)

This paper contains 27 sections, 4 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: SALSA framework to retrieve and re-rank nearest point clouds using scene and local descriptors, respectively.
  • Figure 2: Overview of our SALSA framework to generate scene descriptors from point clouds for place recognition. A SphereFormer backbone with radial and cubic window attention is employed to extract local descriptors from point clouds. These local descriptors are fused into tokens via a self-attention adaptive pooling module. Subsequently, the pooled tokens are processed by an MLP mixer-based aggregator to iteratively incorporate global context information. Finally, a PCA whitening layer reduces the dimension and decorrelates the aggregated descriptor, producing the global scene descriptor.
  • Figure 3: Our adaptive pooling module where local descriptors are pooled into tokens using self-attention mechanism.
  • Figure 4: Our token aggregator, that iteratively fuses token embeddings to pool them into a scene descriptor using channel and token mixing MLPs.
  • Figure 5: Box plot displaying Recall@1 across six datasets, with first to third quartile spans, whiskers for data variability, and internal lines as medians.
  • ...and 4 more figures