Table of Contents
Fetching ...

CoBEV: Elevating Roadside 3D Object Detection with Depth and Height Complementarity

Hao Shi, Chengshan Pang, Jiaming Zhang, Kailun Yang, Yuhao Wu, Huajian Ni, Yining Lin, Rainer Stiefelhagen, Kaiwei Wang

TL;DR

Complementary-BEV not only achieves the accuracy of the new state-of-the-art, but also significantly advances the robustness of previous methods in challenging long-distance scenarios and noisy camera disturbance, and enhances generalization by a large margin in heterologous Settings with drastic changes in scene and camera parameters.

Abstract

Roadside camera-driven 3D object detection is a crucial task in intelligent transportation systems, which extends the perception range beyond the limitations of vision-centric vehicles and enhances road safety. While previous studies have limitations in using only depth or height information, we find both depth and height matter and they are in fact complementary. The depth feature encompasses precise geometric cues, whereas the height feature is primarily focused on distinguishing between various categories of height intervals, essentially providing semantic context. This insight motivates the development of Complementary-BEV (CoBEV), a novel end-to-end monocular 3D object detection framework that integrates depth and height to construct robust BEV representations. In essence, CoBEV estimates each pixel's depth and height distribution and lifts the camera features into 3D space for lateral fusion using the newly proposed two-stage complementary feature selection (CFS) module. A BEV feature distillation framework is also seamlessly integrated to further enhance the detection accuracy from the prior knowledge of the fusion-modal CoBEV teacher. We conduct extensive experiments on the public 3D detection benchmarks of roadside camera-based DAIR-V2X-I and Rope3D, as well as the private Supremind-Road dataset, demonstrating that CoBEV not only achieves the accuracy of the new state-of-the-art, but also significantly advances the robustness of previous methods in challenging long-distance scenarios and noisy camera disturbance, and enhances generalization by a large margin in heterologous settings with drastic changes in scene and camera parameters. For the first time, the vehicle AP score of a camera model reaches 80% on DAIR-V2X-I in terms of easy mode. The source code will be made publicly available at https://github.com/MasterHow/CoBEV.

CoBEV: Elevating Roadside 3D Object Detection with Depth and Height Complementarity

TL;DR

Complementary-BEV not only achieves the accuracy of the new state-of-the-art, but also significantly advances the robustness of previous methods in challenging long-distance scenarios and noisy camera disturbance, and enhances generalization by a large margin in heterologous Settings with drastic changes in scene and camera parameters.

Abstract

Roadside camera-driven 3D object detection is a crucial task in intelligent transportation systems, which extends the perception range beyond the limitations of vision-centric vehicles and enhances road safety. While previous studies have limitations in using only depth or height information, we find both depth and height matter and they are in fact complementary. The depth feature encompasses precise geometric cues, whereas the height feature is primarily focused on distinguishing between various categories of height intervals, essentially providing semantic context. This insight motivates the development of Complementary-BEV (CoBEV), a novel end-to-end monocular 3D object detection framework that integrates depth and height to construct robust BEV representations. In essence, CoBEV estimates each pixel's depth and height distribution and lifts the camera features into 3D space for lateral fusion using the newly proposed two-stage complementary feature selection (CFS) module. A BEV feature distillation framework is also seamlessly integrated to further enhance the detection accuracy from the prior knowledge of the fusion-modal CoBEV teacher. We conduct extensive experiments on the public 3D detection benchmarks of roadside camera-based DAIR-V2X-I and Rope3D, as well as the private Supremind-Road dataset, demonstrating that CoBEV not only achieves the accuracy of the new state-of-the-art, but also significantly advances the robustness of previous methods in challenging long-distance scenarios and noisy camera disturbance, and enhances generalization by a large margin in heterologous settings with drastic changes in scene and camera parameters. For the first time, the vehicle AP score of a camera model reaches 80% on DAIR-V2X-I in terms of easy mode. The source code will be made publicly available at https://github.com/MasterHow/CoBEV.
Paper Structure (21 sections, 26 equations, 12 figures, 9 tables)

This paper contains 21 sections, 26 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Overview of the Complementary-BEV (CoBEV) architecture. Firstly, the monocular image on the roadside is fed into the feature extractor to encode high-dimensional features. Image features are then sent to the Camera-aware Hybrid Lifting (CHL) module that consists of a depth branch, a context branch, and a height branch, and fused with the camera parameters encoded by the MLP. The pixel distribution of the depth and height branches are integrated with the context feature via an outer product to obtain a frustum-shaped point cloud. The point cloud coordinates can be obtained by depth or height geometry (see Fig. \ref{['fig:lifting']} for details). This point cloud is then splatted to the depth-based and height-based compressed 3D features by partial-pillar voxel pooling. Finally, the multi-source BEV features are fused via the Complementary Feature Selection (CFS) module to develop robust BEV features for 3D object detection.
  • Figure 2: Camera-based Hybrid Lifting includes (a) Explicit lifting based on the depth distribution and camera parameters. (b) Explicit lifting with similar triangles based on the height distribution and camera parameters.
  • Figure 3: The proposed Complementary Feature Selection (CFS) module. Different from previous works yang2023bevheight, we maintain a low-scale vertical axis of depth and height BEV features after voxel pooling to promote information flow in the feature fusion process. CFS consists of two cascade feature selection processes, with the first stage for selecting complementary features in the column-shape channels, and the second stage for selecting features in the BEV plane, which are ultimately compressed to two-dimensional complementary BEV features through the stride 3D convolutional compression.
  • Figure 4: Illustration of the BEV Feature Distillation. It employs a fusion-to-camera paradigm, aligning the student CoBEV detector with the LiDAR-camera fusion version of the CoBEV teacher across three stages: low-level BEV feature, high-level BEV feature, and the response. Different from previous work zhou2023unidistill, we only apply supervision signal adaption at low-level features to emphasize valuable street-view structural knowledge of all categories at high-level features and therefore achieve performance improvement agnostic to the object size.
  • Figure 5: Qualitative comparisons on the DAIR-V2X-I dataset yu2022dair. BEVDepth li2022bevdepth and BEVHeight yang2023bevheight exhibit missed detections when encountering distant targets. In contrast, CoBEV demonstrates the capability to effectively address challenging long-distance targets, surpassing the performance of prior methods.
  • ...and 7 more figures