Table of Contents
Fetching ...

DuoSpaceNet: Leveraging Both Bird's-Eye-View and Perspective View Representations for 3D Object Detection

Zhe Huang, Yizhe Zhao, Hao Xiao, Chenyan Wu, Lingting Ge

TL;DR

DuoSpaceNet addresses camera-only 3D object detection by unifying BEV and PV representations in a single detection pipeline. It introduces a Duo Space Decoder with dual-space queries and space-specific cross-attention, augmented by a feature divergence enhancement and a temporal modeling module to handle multi-frame inputs. The approach achieves superior performance on nuScenes for both 3D detection and BEV map segmentation compared with state-of-the-art BEV-only and PV-only methods, with ablations validating each component. This integrated, multi-task framework offers robust 3D perception with potential for broader applications in autonomous driving and related vision tasks.

Abstract

Multi-view camera-only 3D object detection largely follows two primary paradigms: exploiting bird's-eye-view (BEV) representations or focusing on perspective-view (PV) features, each with distinct advantages. Although several recent approaches explore combining BEV and PV, many rely on partial fusion or maintain separate detection heads. In this paper, we propose DuoSpaceNet, a novel framework that fully unifies BEV and PV feature spaces within a single detection pipeline for comprehensive 3D perception. Our design includes a decoder to integrate BEV and PV features into unified detection queries, as well as a feature enhancement strategy that enriches different feature representations. In addition, DuoSpaceNet can be extended to handle multi-frame inputs, enabling more robust temporal analysis. Extensive experiments on nuScenes dataset show that DuoSpaceNet surpasses both BEV-based baselines (e.g., BEVFormer) and PV-based baselines (e.g., Sparse4D) in 3D object detection and BEV map segmentation, verifying the effectiveness of our proposed design.

DuoSpaceNet: Leveraging Both Bird's-Eye-View and Perspective View Representations for 3D Object Detection

TL;DR

DuoSpaceNet addresses camera-only 3D object detection by unifying BEV and PV representations in a single detection pipeline. It introduces a Duo Space Decoder with dual-space queries and space-specific cross-attention, augmented by a feature divergence enhancement and a temporal modeling module to handle multi-frame inputs. The approach achieves superior performance on nuScenes for both 3D detection and BEV map segmentation compared with state-of-the-art BEV-only and PV-only methods, with ablations validating each component. This integrated, multi-task framework offers robust 3D perception with potential for broader applications in autonomous driving and related vision tasks.

Abstract

Multi-view camera-only 3D object detection largely follows two primary paradigms: exploiting bird's-eye-view (BEV) representations or focusing on perspective-view (PV) features, each with distinct advantages. Although several recent approaches explore combining BEV and PV, many rely on partial fusion or maintain separate detection heads. In this paper, we propose DuoSpaceNet, a novel framework that fully unifies BEV and PV feature spaces within a single detection pipeline for comprehensive 3D perception. Our design includes a decoder to integrate BEV and PV features into unified detection queries, as well as a feature enhancement strategy that enriches different feature representations. In addition, DuoSpaceNet can be extended to handle multi-frame inputs, enabling more robust temporal analysis. Extensive experiments on nuScenes dataset show that DuoSpaceNet surpasses both BEV-based baselines (e.g., BEVFormer) and PV-based baselines (e.g., Sparse4D) in 3D object detection and BEV map segmentation, verifying the effectiveness of our proposed design.
Paper Structure (27 sections, 7 equations, 7 figures, 8 tables)

This paper contains 27 sections, 7 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Comparison of different image-only 3D perception frameworks. (a) Top: Bird's-eye-view-based (BEV-based) methods. (b) Middle: Perspective-view-based (PV-based) 3D detection-only methods. (c) Bottom: DuoSpaceNet (ours) where 3D detection benefits from both 3D BEV and 2D PV feature space.
  • Figure 2: Comparison of bird’s‐eye‐view (BEV) and perspective‐view (PV) features.(a) Left: The original front-facing RGB image. (b) Middle: PV heatmap. (c) Right: BEV heatmap. Both heatmaps are produced by our final model. Different hues and brightness levels represent various response intensities with black indicating minimal or no response. We highlight two car instances, each shown with its original image patch as well as corresponding PV and BEV heatmaps. The BEV heatmap in (c) makes it easier to interpret 3D positions---for instance, the relative positions of leading vehicles---without overlap issues. In contrast, the PV heat map in (b) preserves finer semantic details at higher resolution, which benefits attribute prediction.
  • Figure 3: Overall architecture of the proposed DuoSpaceNet. Multi-view 2D perspective view (PV) features are extracted by the backbone and the feature pyramid network (FPN) lin2017feature. Our 2D to 3D BEV lifting strategy consists of a parameter-free voxel projection following harley2023simplebev and a divergence enhancement process to make resulting BEV features more distinctive w.r.t. PV features. In our duo space framework, multi-view PV features and BEV features are identified as equally important and are fed into the decoder together. Each decoder layer has one self-attention layer and two deformable cross-attention layers zhu2020deformable. The self-attention layer acts on both BEV and PV spaces, whereas each cross-attention layer only attends to either BEV features or PV features. This space-specific cross-attention helps preserve the uniqueness of different feature spaces throughout multi-layer refinement process. Details about the Duo Space Query Composition can be found in \ref{['Eq:z_bev', 'Eq:z_pv', 'Eq:qkv']}. Dense map segmentation can be jointly carried out via a separate segmentation head.
  • Figure 4: Diagram of the proposed duo space temporal modeling with 4 frames. Temporal pose embeddings $Q^{(t)}_{Pose}$ are generated by warping pose vectors at current timestamp through motion compensation. Subsequently, temporal duo space queries $\mathbf{z}^{(t)}_{BEV}$ and $\mathbf{z}^{(t)}_{PV}$ are assembled by broadcasting current content embeddings over the time dimension and then combining them with the temporal pose embeddings. We then conduct space-specific cross-attention using recent BEV and PV feature maps, both of which are maintained by their respective memory queue. Note that temporal queries from each timestamp only interact with feature maps corresponding to that timestamp. The resulting temporal queries are aggregated via a MLP in a recurrent fashion.
  • Figure 5: Qualitative comparison of top‐down 3D detection results among our method, PV‐only, and BEV‐only models. Ground truth bounding boxes are in green and predictions are in blue. Our prediction aligns most accurately with the ground truth. Please refer to the supplementary materials for more visualization.
  • ...and 2 more figures