Table of Contents
Fetching ...

DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences

Peidong Li, Wancheng Shen, Qihao Huang, Dixiao Cui

TL;DR

DualBEV addresses BEV perception by unifying dual view transformations (3D‑to‑2D and 2D‑to‑3D) through a probabilistic framework that estimates correspondences via BEV, projection, and image probabilities. It introduces HeightTrans for CNN‑based 3D‑to‑2D VT and Prob‑LSS to strengthen LSS‑style 2D‑to‑3D VT, fused in one stage by the Dual Feature Fusion module to produce robust BEV features with BEV probability guidance. The approach achieves state‑of‑the‑art performance on nuScenes without Transformer, reporting 55.2% mAP and 63.4% NDS on the test set, while maintaining near real‑time efficiency through precomputation. Extensive ablations validate the contributions of probabilistic measurements, Prob‑Sampling, multi‑height sampling, Prob‑LSS, and the DFF fusion design, with qualitative visualizations confirming improved detection across ranges. Limitations include reliance on single‑frame depth signals and the absence of a temporal module, suggesting future work to integrate temporal context and extend to BEV segmentation or 3D occupancy tasks.

Abstract

Camera-based Bird's-Eye-View (BEV) perception often struggles between adopting 3D-to-2D or 2D-to-3D view transformation (VT). The 3D-to-2D VT typically employs resource-intensive Transformer to establish robust correspondences between 3D and 2D features, while the 2D-to-3D VT utilizes the Lift-Splat-Shoot (LSS) pipeline for real-time application, potentially missing distant information. To address these limitations, we propose DualBEV, a unified framework that utilizes a shared feature transformation incorporating three probabilistic measurements for both strategies. By considering dual-view correspondences in one stage, DualBEV effectively bridges the gap between these strategies, harnessing their individual strengths. Our method achieves state-of-the-art performance without Transformer, delivering comparable efficiency to the LSS approach, with 55.2% mAP and 63.4% NDS on the nuScenes test set. Code is available at \url{https://github.com/PeidongLi/DualBEV}

DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences

TL;DR

DualBEV addresses BEV perception by unifying dual view transformations (3D‑to‑2D and 2D‑to‑3D) through a probabilistic framework that estimates correspondences via BEV, projection, and image probabilities. It introduces HeightTrans for CNN‑based 3D‑to‑2D VT and Prob‑LSS to strengthen LSS‑style 2D‑to‑3D VT, fused in one stage by the Dual Feature Fusion module to produce robust BEV features with BEV probability guidance. The approach achieves state‑of‑the‑art performance on nuScenes without Transformer, reporting 55.2% mAP and 63.4% NDS on the test set, while maintaining near real‑time efficiency through precomputation. Extensive ablations validate the contributions of probabilistic measurements, Prob‑Sampling, multi‑height sampling, Prob‑LSS, and the DFF fusion design, with qualitative visualizations confirming improved detection across ranges. Limitations include reliance on single‑frame depth signals and the absence of a temporal module, suggesting future work to integrate temporal context and extend to BEV segmentation or 3D occupancy tasks.

Abstract

Camera-based Bird's-Eye-View (BEV) perception often struggles between adopting 3D-to-2D or 2D-to-3D view transformation (VT). The 3D-to-2D VT typically employs resource-intensive Transformer to establish robust correspondences between 3D and 2D features, while the 2D-to-3D VT utilizes the Lift-Splat-Shoot (LSS) pipeline for real-time application, potentially missing distant information. To address these limitations, we propose DualBEV, a unified framework that utilizes a shared feature transformation incorporating three probabilistic measurements for both strategies. By considering dual-view correspondences in one stage, DualBEV effectively bridges the gap between these strategies, harnessing their individual strengths. Our method achieves state-of-the-art performance without Transformer, delivering comparable efficiency to the LSS approach, with 55.2% mAP and 63.4% NDS on the nuScenes test set. Code is available at \url{https://github.com/PeidongLi/DualBEV}
Paper Structure (29 sections, 11 equations, 6 figures, 4 tables)

This paper contains 29 sections, 11 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Unified Feature Transformation: Our approach considers correspondences between BEV and image space utilizing image probability, projection probability and BEV probability. In the 3D-to-2D strategy, HeightTrans (HT) projects pre-defined 3D points to sample features, while in the 2D-to-3D strategy, LSS lifts image features to 3D space, both through image probability and projection probability from different directions. Finally, BEV probability is applied to enhance the representation of features.
  • Figure 2: Comparison of Fusion Strategy. $\oplus$ means sum function. $\otimes$ denotes multiplication. $\circledcirc$ denotes channel-attention-based fusion.
  • Figure 3: Overview of DualBEV: Initially, we employ SceneNet to predict the depth $D$ (Projection probability) and instance mask $M$ (Image probability) of input images. Subsequently, the Prob-LSS stream follows the BEVPoolv2bevpoolv2 to generate LSS feature. Concurrently, the HeightTrans stream utilizes the Prob-Sampling to project pre-defined 3D points onto the 2D space, retrieving corresponding image features. Throughout this process, all features are accompanied by probabilities derived from the depth map and instance mask. Finally, we fuse two streams and predict the BEV probability $P$ by leveraging the DFF module, resulting in the final BEV feature $F$.
  • Figure 4: BEV Feature Visualization with GT boxes. Prob-LSS pays more attention to close range while HeightTrans can also capture distant information. In the red rectangle (distant range), where our unified framework compensates weak detection on barriers of Prob-LSS with HeightTrans. In the orange rectangle (close range), BEV features are enhanced from dual streams. Ego locates in the center of BEV features.
  • Figure 5: Dual Feature Fusion Module: Dual features are first concatenated and then passed into the CAF module for fusion. Subsequently, the SAE-ProbNet is utilized to obtain the BEV probability for the final BEV feature.
  • ...and 1 more figures