SALSA: Swift Adaptive Lightweight Self-Attention for Enhanced LiDAR Place Recognition
Raktim Gautam Goswami, Naman Patel, Prashanth Krishnamurthy, Farshad Khorrami
TL;DR
LiDAR place recognition is critical for SLAM relocalization and loop closure but demands both accuracy and efficiency. SALSA delivers a lightweight solution by fusing a SphereFormer backbone with radial window attention, an adaptive attention pooling module, and an MLP-Mixer-based token aggregator, followed by PCA whitening; re-ranking with SpectralGV further improves geometric consistency. The approach achieves state-of-the-art retrieval and 6-DoF localization across six large-scale datasets while maintaining real-time performance and a small memory footprint, with extensive ablations showing the effectiveness of the adaptive pooling and radial attention. The work demonstrates SALSA’s robustness to rotation and occlusion and shows practical integration into LiDAR-SLAM pipelines for loop closure and pose-graph optimization.
Abstract
Large-scale LiDAR mappings and localization leverage place recognition techniques to mitigate odometry drifts, ensuring accurate mapping. These techniques utilize scene representations from LiDAR point clouds to identify previously visited sites within a database. Local descriptors, assigned to each point within a point cloud, are aggregated to form a scene representation for the point cloud. These descriptors are also used to re-rank the retrieved point clouds based on geometric fitness scores. We propose SALSA, a novel, lightweight, and efficient framework for LiDAR place recognition. It consists of a Sphereformer backbone that uses radial window attention to enable information aggregation for sparse distant points, an adaptive self-attention layer to pool local descriptors into tokens, and a multi-layer-perceptron Mixer layer for aggregating the tokens to generate a scene descriptor. The proposed framework outperforms existing methods on various LiDAR place recognition datasets in terms of both retrieval and metric localization while operating in real-time.
