Table of Contents
Fetching ...

HARP-NeXt: High-Speed and Accurate Range-Point Fusion Network for 3D LiDAR Semantic Segmentation

Samir Abou Haidar, Alexandre Chariot, Mehdi Darouich, Cyril Joly, Jean-Emmanuel Deschaud

TL;DR

HARP-NeXt tackles the speed-accuracy tension in LiDAR semantic segmentation for resource-constrained platforms by integrating GPU-based pre-processing, a lightweight Conv-SE-NeXt feature extractor, and a multi-scale range-point fusion backbone that jointly leverages 2D range images and 3D points. It introduces efficient cross-domain feature mappings and a residual-attentive Fusion Head, guided by a multi-term loss that supervises both pixel- and point-level predictions. The approach achieves competitive mIoU on nuScenes and SemanticKITTI while delivering ultra-fast runtimes (≈10 ms on RTX4090 and ≈71 ms on Jetson AGX Orin) with a small parameter count (~5.4M) and no test-time augmentation. This combination enables real-time, high-precision LiDAR perception on embedded platforms, broadening applicability for autonomous vehicles and mobile robots.

Abstract

LiDAR semantic segmentation is crucial for autonomous vehicles and mobile robots, requiring high accuracy and real-time processing, especially on resource-constrained embedded systems. Previous state-of-the-art methods often face a trade-off between accuracy and speed. Point-based and sparse convolution-based methods are accurate but slow due to the complexity of neighbor searching and 3D convolutions. Projection-based methods are faster but lose critical geometric information during the 2D projection. Additionally, many recent methods rely on test-time augmentation (TTA) to improve performance, which further slows the inference. Moreover, the pre-processing phase across all methods increases execution time and is demanding on embedded platforms. Therefore, we introduce HARP-NeXt, a high-speed and accurate LiDAR semantic segmentation network. We first propose a novel pre-processing methodology that significantly reduces computational overhead. Then, we design the Conv-SE-NeXt feature extraction block to efficiently capture representations without deep layer stacking per network stage. We also employ a multi-scale range-point fusion backbone that leverages information at multiple abstraction levels to preserve essential geometric details, thereby enhancing accuracy. Experiments on the nuScenes and SemanticKITTI benchmarks show that HARP-NeXt achieves a superior speed-accuracy trade-off compared to all state-of-the-art methods, and, without relying on ensemble models or TTA, is comparable to the top-ranked PTv3, while running 24$\times$ faster. The code is available at https://github.com/SamirAbouHaidar/HARP-NeXt

HARP-NeXt: High-Speed and Accurate Range-Point Fusion Network for 3D LiDAR Semantic Segmentation

TL;DR

HARP-NeXt tackles the speed-accuracy tension in LiDAR semantic segmentation for resource-constrained platforms by integrating GPU-based pre-processing, a lightweight Conv-SE-NeXt feature extractor, and a multi-scale range-point fusion backbone that jointly leverages 2D range images and 3D points. It introduces efficient cross-domain feature mappings and a residual-attentive Fusion Head, guided by a multi-term loss that supervises both pixel- and point-level predictions. The approach achieves competitive mIoU on nuScenes and SemanticKITTI while delivering ultra-fast runtimes (≈10 ms on RTX4090 and ≈71 ms on Jetson AGX Orin) with a small parameter count (~5.4M) and no test-time augmentation. This combination enables real-time, high-precision LiDAR perception on embedded platforms, broadening applicability for autonomous vehicles and mobile robots.

Abstract

LiDAR semantic segmentation is crucial for autonomous vehicles and mobile robots, requiring high accuracy and real-time processing, especially on resource-constrained embedded systems. Previous state-of-the-art methods often face a trade-off between accuracy and speed. Point-based and sparse convolution-based methods are accurate but slow due to the complexity of neighbor searching and 3D convolutions. Projection-based methods are faster but lose critical geometric information during the 2D projection. Additionally, many recent methods rely on test-time augmentation (TTA) to improve performance, which further slows the inference. Moreover, the pre-processing phase across all methods increases execution time and is demanding on embedded platforms. Therefore, we introduce HARP-NeXt, a high-speed and accurate LiDAR semantic segmentation network. We first propose a novel pre-processing methodology that significantly reduces computational overhead. Then, we design the Conv-SE-NeXt feature extraction block to efficiently capture representations without deep layer stacking per network stage. We also employ a multi-scale range-point fusion backbone that leverages information at multiple abstraction levels to preserve essential geometric details, thereby enhancing accuracy. Experiments on the nuScenes and SemanticKITTI benchmarks show that HARP-NeXt achieves a superior speed-accuracy trade-off compared to all state-of-the-art methods, and, without relying on ensemble models or TTA, is comparable to the top-ranked PTv3, while running 24 faster. The code is available at https://github.com/SamirAbouHaidar/HARP-NeXt

Paper Structure

This paper contains 11 sections, 19 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: mIoU vs. runtime on nuScenes validation set. HARP-NeXt achieves high accuracy in near real-time with fast execution, breaking performance trends when deployed on the Jetson AGX Orin.
  • Figure 2: Pre-processing workflows. Dashed blocks indicate on GPU processes; arrow thickness reflects data bandwidth.
  • Figure 3: Design structure of ResNet he2016resnet, ConvNeXt liu2022convnext, SE-ResNet hu2018squeezeexcitation and Conv-SE-NeXt blocks.
  • Figure 4: HARP-NeXt architecture consists of: 1) a Features Encoder that embeds initial 2D and 3D features, 2) a Backbone that employs a single Conv-SE-NeXt feature extractor per network stage, and fuses pixel and point features at multi-scale via efficient mappings $\mathcal{P}_{t}2\mathcal{P}_{x} : \mathcal{P}_{x} = \mathcal{M}(\mathcal{P}_{t})$ and $\mathcal{P}_{x}2\mathcal{P}_{t} : \mathcal{P}_{t} = \mathcal{M}^{-1}(\mathcal{P}_{x})$ to hierarchically refine them, and 3) a Fusion Head that combines different context-aware information from pixel and point levels to predict a label for each 3D point.
  • Figure 5: HARP-NeXt's results compared to the ground truth.
  • ...and 2 more figures