DuoSpaceNet: Leveraging Both Bird's-Eye-View and Perspective View Representations for 3D Object Detection
Zhe Huang, Yizhe Zhao, Hao Xiao, Chenyan Wu, Lingting Ge
TL;DR
DuoSpaceNet addresses camera-only 3D object detection by unifying BEV and PV representations in a single detection pipeline. It introduces a Duo Space Decoder with dual-space queries and space-specific cross-attention, augmented by a feature divergence enhancement and a temporal modeling module to handle multi-frame inputs. The approach achieves superior performance on nuScenes for both 3D detection and BEV map segmentation compared with state-of-the-art BEV-only and PV-only methods, with ablations validating each component. This integrated, multi-task framework offers robust 3D perception with potential for broader applications in autonomous driving and related vision tasks.
Abstract
Multi-view camera-only 3D object detection largely follows two primary paradigms: exploiting bird's-eye-view (BEV) representations or focusing on perspective-view (PV) features, each with distinct advantages. Although several recent approaches explore combining BEV and PV, many rely on partial fusion or maintain separate detection heads. In this paper, we propose DuoSpaceNet, a novel framework that fully unifies BEV and PV feature spaces within a single detection pipeline for comprehensive 3D perception. Our design includes a decoder to integrate BEV and PV features into unified detection queries, as well as a feature enhancement strategy that enriches different feature representations. In addition, DuoSpaceNet can be extended to handle multi-frame inputs, enabling more robust temporal analysis. Extensive experiments on nuScenes dataset show that DuoSpaceNet surpasses both BEV-based baselines (e.g., BEVFormer) and PV-based baselines (e.g., Sparse4D) in 3D object detection and BEV map segmentation, verifying the effectiveness of our proposed design.
