Table of Contents
Fetching ...

DAGLFNet: Deep Feature Attention Guided Global and Local Feature Fusion for Pseudo-Image Point Cloud Segmentation

Chuang Chen, Yi Lin, Bo Wang, Jing Hu, Xi Wu, Wenyi Ge

TL;DR

DAGLFNet tackles the challenge of efficient yet discriminative LiDAR semantic segmentation by leveraging a pseudo-image representation augmented with three core components: Global-Local Feature Fusion Encoding (GL-FFE) to stabilize local geometry and capture global context, Multi-Branch Feature Extraction (MB-FE) to expand receptive fields and sharpen boundaries, and Deep Feature-guided Attention (FFDFA) to refine cross-channel fusion using depth cues. The framework jointly learns point-level and group-level features through a depth-guided attention mechanism and a fusion head that aligns multi-scale information, achieving strong mIoU on SemanticKITTI ($ ext{mIoU} ightarrow$ $69.9 ext{%}$ with augmentation) and nuScenes ($ ext{mIoU} ightarrow$ $78.7 ext{%}$ with augmentation), while maintaining competitive efficiency. The work demonstrates that integrating global-local context, boundary-enhanced multi-branch features, and depth-aware fusion in pseudo-image pipelines yields robust segmentation for long-range, sparse, and occluded LiDAR scenes, with practical implications for real-time autonomous navigation. Future work could further improve robustness in extremely sparse or occluded regions by refining geometric-semantic representations and exploring adaptive fusion at finer spatial scales.

Abstract

Environmental perception systems are crucial for high-precision mapping and autonomous navigation, with LiDAR serving as a core sensor providing accurate 3D point cloud data. Efficiently processing unstructured point clouds while extracting structured semantic information remains a significant challenge. In recent years, numerous pseudo-image-based representation methods have emerged to balance efficiency and performance by fusing 3D point clouds with 2D grids. However, the fundamental inconsistency between the pseudo-image representation and the original 3D information critically undermines 2D-3D feature fusion, posing a primary obstacle for coherent information fusion and leading to poor feature discriminability. This work proposes DAGLFNet, a pseudo-image-based semantic segmentation framework designed to extract discriminative features. It incorporates three key components: first, a Global-Local Feature Fusion Encoding (GL-FFE) module to enhance intra-set local feature correlation and capture global contextual information; second, a Multi-Branch Feature Extraction (MB-FE) network to capture richer neighborhood information and improve the discriminability of contour features; and third, a Feature Fusion via Deep Feature-guided Attention (FFDFA) mechanism to refine cross-channel feature fusion precision. Experimental evaluations demonstrate that DAGLFNet achieves mean Intersection-over-Union (mIoU) scores of 69.9% and 78.7% on the validation sets of SemanticKITTI and nuScenes, respectively. The method achieves an excellent balance between accuracy and efficiency.

DAGLFNet: Deep Feature Attention Guided Global and Local Feature Fusion for Pseudo-Image Point Cloud Segmentation

TL;DR

DAGLFNet tackles the challenge of efficient yet discriminative LiDAR semantic segmentation by leveraging a pseudo-image representation augmented with three core components: Global-Local Feature Fusion Encoding (GL-FFE) to stabilize local geometry and capture global context, Multi-Branch Feature Extraction (MB-FE) to expand receptive fields and sharpen boundaries, and Deep Feature-guided Attention (FFDFA) to refine cross-channel fusion using depth cues. The framework jointly learns point-level and group-level features through a depth-guided attention mechanism and a fusion head that aligns multi-scale information, achieving strong mIoU on SemanticKITTI ( with augmentation) and nuScenes ( with augmentation), while maintaining competitive efficiency. The work demonstrates that integrating global-local context, boundary-enhanced multi-branch features, and depth-aware fusion in pseudo-image pipelines yields robust segmentation for long-range, sparse, and occluded LiDAR scenes, with practical implications for real-time autonomous navigation. Future work could further improve robustness in extremely sparse or occluded regions by refining geometric-semantic representations and exploring adaptive fusion at finer spatial scales.

Abstract

Environmental perception systems are crucial for high-precision mapping and autonomous navigation, with LiDAR serving as a core sensor providing accurate 3D point cloud data. Efficiently processing unstructured point clouds while extracting structured semantic information remains a significant challenge. In recent years, numerous pseudo-image-based representation methods have emerged to balance efficiency and performance by fusing 3D point clouds with 2D grids. However, the fundamental inconsistency between the pseudo-image representation and the original 3D information critically undermines 2D-3D feature fusion, posing a primary obstacle for coherent information fusion and leading to poor feature discriminability. This work proposes DAGLFNet, a pseudo-image-based semantic segmentation framework designed to extract discriminative features. It incorporates three key components: first, a Global-Local Feature Fusion Encoding (GL-FFE) module to enhance intra-set local feature correlation and capture global contextual information; second, a Multi-Branch Feature Extraction (MB-FE) network to capture richer neighborhood information and improve the discriminability of contour features; and third, a Feature Fusion via Deep Feature-guided Attention (FFDFA) mechanism to refine cross-channel feature fusion precision. Experimental evaluations demonstrate that DAGLFNet achieves mean Intersection-over-Union (mIoU) scores of 69.9% and 78.7% on the validation sets of SemanticKITTI and nuScenes, respectively. The method achieves an excellent balance between accuracy and efficiency.

Paper Structure

This paper contains 19 sections, 15 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: (a) Visualization of feature ambiguity and boundary blurring through average pooling of encoded feature channels from the LiDAR point cloud representation, and (b) corresponding semantic segmentation result demonstrating the classification performance.
  • Figure 2: The proposed DAGLFNet framework consists of key components such as GL-FFE, MB-FE, FFDFA, and the Fusion Head, which are responsible for contextual and geometric feature extraction, boundary enhancement, multi-scale feature integration, and final prediction respectively. Multiple stacked DAGLFNet units continuously learn complex hierarchical features from the point cloud, with the Fusion Head combining point-level and group-level features to predict the final output.
  • Figure 3: Class-wise LiDAR segmentation results of DAGLFNet and the baseline model on the val set of SemanticKITTI behley2019semantickitti.
  • Figure 4: Comparison of mIoU (%) between DAGLFNet and the baseline method across different distance ranges.
  • Figure 5: mIoU vs. inference speed for various point cloud semantic segmentation methods on the SemanticKITTI behley2019semantickitti validation set. Marker size indicates model size. DAGLFNet achieves the best balance of accuracy and speed.
  • ...and 4 more figures