DAGLFNet: Deep Feature Attention Guided Global and Local Feature Fusion for Pseudo-Image Point Cloud Segmentation
Chuang Chen, Yi Lin, Bo Wang, Jing Hu, Xi Wu, Wenyi Ge
TL;DR
DAGLFNet tackles the challenge of efficient yet discriminative LiDAR semantic segmentation by leveraging a pseudo-image representation augmented with three core components: Global-Local Feature Fusion Encoding (GL-FFE) to stabilize local geometry and capture global context, Multi-Branch Feature Extraction (MB-FE) to expand receptive fields and sharpen boundaries, and Deep Feature-guided Attention (FFDFA) to refine cross-channel fusion using depth cues. The framework jointly learns point-level and group-level features through a depth-guided attention mechanism and a fusion head that aligns multi-scale information, achieving strong mIoU on SemanticKITTI ($ ext{mIoU} ightarrow$ $69.9 ext{%}$ with augmentation) and nuScenes ($ ext{mIoU} ightarrow$ $78.7 ext{%}$ with augmentation), while maintaining competitive efficiency. The work demonstrates that integrating global-local context, boundary-enhanced multi-branch features, and depth-aware fusion in pseudo-image pipelines yields robust segmentation for long-range, sparse, and occluded LiDAR scenes, with practical implications for real-time autonomous navigation. Future work could further improve robustness in extremely sparse or occluded regions by refining geometric-semantic representations and exploring adaptive fusion at finer spatial scales.
Abstract
Environmental perception systems are crucial for high-precision mapping and autonomous navigation, with LiDAR serving as a core sensor providing accurate 3D point cloud data. Efficiently processing unstructured point clouds while extracting structured semantic information remains a significant challenge. In recent years, numerous pseudo-image-based representation methods have emerged to balance efficiency and performance by fusing 3D point clouds with 2D grids. However, the fundamental inconsistency between the pseudo-image representation and the original 3D information critically undermines 2D-3D feature fusion, posing a primary obstacle for coherent information fusion and leading to poor feature discriminability. This work proposes DAGLFNet, a pseudo-image-based semantic segmentation framework designed to extract discriminative features. It incorporates three key components: first, a Global-Local Feature Fusion Encoding (GL-FFE) module to enhance intra-set local feature correlation and capture global contextual information; second, a Multi-Branch Feature Extraction (MB-FE) network to capture richer neighborhood information and improve the discriminability of contour features; and third, a Feature Fusion via Deep Feature-guided Attention (FFDFA) mechanism to refine cross-channel feature fusion precision. Experimental evaluations demonstrate that DAGLFNet achieves mean Intersection-over-Union (mIoU) scores of 69.9% and 78.7% on the validation sets of SemanticKITTI and nuScenes, respectively. The method achieves an excellent balance between accuracy and efficiency.
