Table of Contents
Fetching ...

PolarBEVDet: Exploring Polar Representation for Multi-View 3D Object Detection in Bird's-Eye-View

Zichen Yu, Quanli Liu, Wei Wang, Liyong Zhang, Xiaoguang Zhao

TL;DR

PolarBEVDet introduces polar BEV representation to multi-view 3D object detection to address non-uniform image information distribution and view symmetry loss in Cartesian BEV. It combines a polar view transformer, polar temporal fusion, and a polar detection head, augmented with 2D auxiliary supervision and a spatial attention enhancement module. On nuScenes, PolarBEVDet achieves state-of-the-art results and demonstrates improved near-field perception and azimuth robustness, with good generalization across backbones and baselines. The work validates polar BEV as a viable alternative to Cartesian BEV in LSS-based camera BEV pipelines, offering both accuracy and efficiency gains for multi-view perception.

Abstract

Recently, LSS-based multi-view 3D object detection provides an economical and deployment-friendly solution for autonomous driving. However, all the existing LSS-based methods transform multi-view image features into a Cartesian Bird's-Eye-View(BEV) representation, which does not take into account the non-uniform image information distribution and hardly exploits the view symmetry. In this paper, in order to adapt the image information distribution and preserve the view symmetry by regular convolution, we propose to employ the polar BEV representation to substitute the Cartesian BEV representation. To achieve this, we elaborately tailor three modules: a polar view transformer to generate the polar BEV representation, a polar temporal fusion module for fusing historical polar BEV features and a polar detection head to predict the polar-parameterized representation of the object. In addition, we design a 2D auxiliary detection head and a spatial attention enhancement module to improve the quality of feature extraction in perspective view and BEV, respectively. Finally, we integrate the above improvements into a novel multi-view 3D object detector, PolarBEVDet. Experiments on nuScenes show that PolarBEVDet achieves the superior performance. The code is available at https://github.com/Yzichen/PolarBEVDet.git.(This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible)

PolarBEVDet: Exploring Polar Representation for Multi-View 3D Object Detection in Bird's-Eye-View

TL;DR

PolarBEVDet introduces polar BEV representation to multi-view 3D object detection to address non-uniform image information distribution and view symmetry loss in Cartesian BEV. It combines a polar view transformer, polar temporal fusion, and a polar detection head, augmented with 2D auxiliary supervision and a spatial attention enhancement module. On nuScenes, PolarBEVDet achieves state-of-the-art results and demonstrates improved near-field perception and azimuth robustness, with good generalization across backbones and baselines. The work validates polar BEV as a viable alternative to Cartesian BEV in LSS-based camera BEV pipelines, offering both accuracy and efficiency gains for multi-view perception.

Abstract

Recently, LSS-based multi-view 3D object detection provides an economical and deployment-friendly solution for autonomous driving. However, all the existing LSS-based methods transform multi-view image features into a Cartesian Bird's-Eye-View(BEV) representation, which does not take into account the non-uniform image information distribution and hardly exploits the view symmetry. In this paper, in order to adapt the image information distribution and preserve the view symmetry by regular convolution, we propose to employ the polar BEV representation to substitute the Cartesian BEV representation. To achieve this, we elaborately tailor three modules: a polar view transformer to generate the polar BEV representation, a polar temporal fusion module for fusing historical polar BEV features and a polar detection head to predict the polar-parameterized representation of the object. In addition, we design a 2D auxiliary detection head and a spatial attention enhancement module to improve the quality of feature extraction in perspective view and BEV, respectively. Finally, we integrate the above improvements into a novel multi-view 3D object detector, PolarBEVDet. Experiments on nuScenes show that PolarBEVDet achieves the superior performance. The code is available at https://github.com/Yzichen/PolarBEVDet.git.(This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible)
Paper Structure (32 sections, 15 equations, 7 figures, 7 tables)

This paper contains 32 sections, 15 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Illustration of BEV grid distribution and image information distribution. The Cartesian BEV space is uniformly rasterized horizontally and vertically, and the polar BEV space is rasterized angularly and radially. The points with different colors represent the frustum points of the multi-view cameras carrying the image information, and their distribution is consistent with the polar grid distribution, which is dense in the near and sparse in the far.
  • Figure 2: Comparison of feature extraction and prediction based on different BEV representations. (a). Assuming that the multi-view cameras have the same imaging, the initial Cartesian BEV representation obtained by view transformation is view-symmetric. But the subsequent translation-invariant convolution operation destroys this symmetry, resulting in different object features and predictions at different azimuths. (b). When the polar BEV representation is employed, the object features are approximate and parallel in the arrayed polar BEV features. In this way, the view symmetry can be preserved using the regular convolution operation, leading to similar object features and predictions at different azimuths.
  • Figure 3: Framework of PolarBEVDet. First, the multi-view image features extracted by the image-view encoder are fed to the polar view transformer to generate a polar BEV representation, which is subsequently arrayed to obtain an arrayed poalr BEV representation. Then, the polar temporal fusion module fuse the cached historical polar BEV features to utilize the temporal information. Finally, the temporal BEV feature is sent to the polar detection head to predict the polar-parameterized representation of the object after further feature extraction by the SAE module and the BEV encoder. In addition, during training phase, the 2D auxiliary detection head is applied to improve the feature quality in perspective view.
  • Figure 4: Temporal fusion pipeline for polar BEV feature. The historical feature is aligned according to ego-motion and then fused with the current feature by concatenation and $1 \times 1$ convolution. The details of the temporal alignment is illustrated in orange dashed box. In addition, an example of a scenario is given on the left for ease of understanding.
  • Figure 5: Illustration of the Cartesian-parameterized and polar-parameterized prediction targets. Assume that multiple identical objects (represented by white cars) are distributed around the ego-vehicle, and that they are imaged identically in different views. (a) The Cartesian-parameterized prediction targets (for object orientation and velocity) are related to the azimuth of the object, which leads to the same imaging corresponding to different prediction targets. (b) In contrast, the polar-parameterized prediction targets are azimuth-equivalent, which reduces the optimization difficulty.
  • ...and 2 more figures