HARP-NeXt: High-Speed and Accurate Range-Point Fusion Network for 3D LiDAR Semantic Segmentation
Samir Abou Haidar, Alexandre Chariot, Mehdi Darouich, Cyril Joly, Jean-Emmanuel Deschaud
TL;DR
HARP-NeXt tackles the speed-accuracy tension in LiDAR semantic segmentation for resource-constrained platforms by integrating GPU-based pre-processing, a lightweight Conv-SE-NeXt feature extractor, and a multi-scale range-point fusion backbone that jointly leverages 2D range images and 3D points. It introduces efficient cross-domain feature mappings and a residual-attentive Fusion Head, guided by a multi-term loss that supervises both pixel- and point-level predictions. The approach achieves competitive mIoU on nuScenes and SemanticKITTI while delivering ultra-fast runtimes (≈10 ms on RTX4090 and ≈71 ms on Jetson AGX Orin) with a small parameter count (~5.4M) and no test-time augmentation. This combination enables real-time, high-precision LiDAR perception on embedded platforms, broadening applicability for autonomous vehicles and mobile robots.
Abstract
LiDAR semantic segmentation is crucial for autonomous vehicles and mobile robots, requiring high accuracy and real-time processing, especially on resource-constrained embedded systems. Previous state-of-the-art methods often face a trade-off between accuracy and speed. Point-based and sparse convolution-based methods are accurate but slow due to the complexity of neighbor searching and 3D convolutions. Projection-based methods are faster but lose critical geometric information during the 2D projection. Additionally, many recent methods rely on test-time augmentation (TTA) to improve performance, which further slows the inference. Moreover, the pre-processing phase across all methods increases execution time and is demanding on embedded platforms. Therefore, we introduce HARP-NeXt, a high-speed and accurate LiDAR semantic segmentation network. We first propose a novel pre-processing methodology that significantly reduces computational overhead. Then, we design the Conv-SE-NeXt feature extraction block to efficiently capture representations without deep layer stacking per network stage. We also employ a multi-scale range-point fusion backbone that leverages information at multiple abstraction levels to preserve essential geometric details, thereby enhancing accuracy. Experiments on the nuScenes and SemanticKITTI benchmarks show that HARP-NeXt achieves a superior speed-accuracy trade-off compared to all state-of-the-art methods, and, without relying on ensemble models or TTA, is comparable to the top-ranked PTv3, while running 24$\times$ faster. The code is available at https://github.com/SamirAbouHaidar/HARP-NeXt
