Table of Contents
Fetching ...

PC-BEV: An Efficient Polar-Cartesian BEV Fusion Framework for LiDAR Semantic Segmentation

Shoumeng Qiu, Xinrun Li, XiangYang Xue, Jian Pu

TL;DR

This work tackles the inefficiency of cross-view fusion for LiDAR semantic segmentation by proposing PC-BEV, a Polar-Cartesian BEV fusion framework that operates entirely in BEV space using fixed correspondences between polar and Cartesian partitions. A remap-based fusion method enables dense, memory-efficient feature interaction, yielding up to a 170× speedup over point-based methods while preserving contextual information. A Transformer-CNN Mixture Architecture provides global scene understanding plus local refinement for BEV features, delivering strong accuracy with real-time inference on SemanticKITTI and nuScenes. The combination of BEV-only fusion and efficient remapping demonstrates that multiview fusion can be realized without expensive point-wise interactions, offering practical benefits for autonomous driving systems. Code is available at the provided GitHub URL.

Abstract

Although multiview fusion has demonstrated potential in LiDAR segmentation, its dependence on computationally intensive point-based interactions, arising from the lack of fixed correspondences between views such as range view and Bird's-Eye View (BEV), hinders its practical deployment. This paper challenges the prevailing notion that multiview fusion is essential for achieving high performance. We demonstrate that significant gains can be realized by directly fusing Polar and Cartesian partitioning strategies within the BEV space. Our proposed BEV-only segmentation model leverages the inherent fixed grid correspondences between these partitioning schemes, enabling a fusion process that is orders of magnitude faster (170$\times$ speedup) than conventional point-based methods. Furthermore, our approach facilitates dense feature fusion, preserving richer contextual information compared to sparse point-based alternatives. To enhance scene understanding while maintaining inference efficiency, we also introduce a hybrid Transformer-CNN architecture. Extensive evaluation on the SemanticKITTI and nuScenes datasets provides compelling evidence that our method outperforms previous multiview fusion approaches in terms of both performance and inference speed, highlighting the potential of BEV-based fusion for LiDAR segmentation. Code is available at \url{https://github.com/skyshoumeng/PC-BEV.}

PC-BEV: An Efficient Polar-Cartesian BEV Fusion Framework for LiDAR Semantic Segmentation

TL;DR

This work tackles the inefficiency of cross-view fusion for LiDAR semantic segmentation by proposing PC-BEV, a Polar-Cartesian BEV fusion framework that operates entirely in BEV space using fixed correspondences between polar and Cartesian partitions. A remap-based fusion method enables dense, memory-efficient feature interaction, yielding up to a 170× speedup over point-based methods while preserving contextual information. A Transformer-CNN Mixture Architecture provides global scene understanding plus local refinement for BEV features, delivering strong accuracy with real-time inference on SemanticKITTI and nuScenes. The combination of BEV-only fusion and efficient remapping demonstrates that multiview fusion can be realized without expensive point-wise interactions, offering practical benefits for autonomous driving systems. Code is available at the provided GitHub URL.

Abstract

Although multiview fusion has demonstrated potential in LiDAR segmentation, its dependence on computationally intensive point-based interactions, arising from the lack of fixed correspondences between views such as range view and Bird's-Eye View (BEV), hinders its practical deployment. This paper challenges the prevailing notion that multiview fusion is essential for achieving high performance. We demonstrate that significant gains can be realized by directly fusing Polar and Cartesian partitioning strategies within the BEV space. Our proposed BEV-only segmentation model leverages the inherent fixed grid correspondences between these partitioning schemes, enabling a fusion process that is orders of magnitude faster (170 speedup) than conventional point-based methods. Furthermore, our approach facilitates dense feature fusion, preserving richer contextual information compared to sparse point-based alternatives. To enhance scene understanding while maintaining inference efficiency, we also introduce a hybrid Transformer-CNN architecture. Extensive evaluation on the SemanticKITTI and nuScenes datasets provides compelling evidence that our method outperforms previous multiview fusion approaches in terms of both performance and inference speed, highlighting the potential of BEV-based fusion for LiDAR segmentation. Code is available at \url{https://github.com/skyshoumeng/PC-BEV.}

Paper Structure

This paper contains 13 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparison with other projection-based methods, the results demonstrates the advantages of our method over others in terms of both performance and speed. Experiments are conducted on the nuScenes validation set.
  • Figure 2: The pipeline of our proposed Polar-Cartesian BEV fusion framework for 3D point cloud semantic segmentation task. Given a scan of point cloud, it first projected to a polar and a cartesian BEV pseudo-images as input to the Transformer-CNN Mixture architecture feature extraction network. Then the features between the two branches interact with each other bidirectional through the proposed effective PolarToCart (P2C) and CartToPolar (C2P) modules. Finally, we adopt the grid sampling operation to obtain the point-wise features from the concatenated features, and the sampled features are fed into a simple MLP block to obtain the final semantic predictions.
  • Figure 3: Comparisons of the feature interaction operation processes between the previous point-based method and our proposed remap-based method across different settings. $\{\mathrm{T}_1, \mathrm{T}_2, \dots, \mathrm{T}_n\}$ denotes the CUDA kernel processing at different time steps and the corresponding cache states. Point-based method uses points as a bridge to facilitate feature interaction across different perspectives or spatial partitioning strategies, while our remap-based method relies on fixed corresponding in the same BEV space. For the point-based method, each point is treated as an individual for the point-level parallelism, and the fused features are ultimately scattered back to the original feature space. Our method, however, leverages advantage of spatial continuity during the remap process to reduce cache missing, enabling more efficient parallel processing, and eliminating the need for scatter back operations, significantly enhancing computational efficiency.
  • Figure 4: Comparisons between point-based interaction results and our proposed remap-based interaction results, $\mathbin{\vcenter{\hbox{$\m@th\bullet$}}}$ denotes the LiDAR points. The point-based method only fuses features where points exist, resulting in sparse fusion, while our method performs fusion across the entire space, resulting in dense fusion that incorporates more information.