Table of Contents
Fetching ...

HeightFormer: A Semantic Alignment Monocular 3D Object Detection Method from Roadside Perspective

Pei Liu, Zihao Zhang, Haipeng Liu, Nanfang Zheng, Meixin Zhu, Ziyuan Pu

TL;DR

A novel 3D object detection framework integrating Spatial Former and Voxel Pooling Former to enhance 2D-to-3D projection based on height estimation is proposed and results indicate that the algorithm is robust and generalized under various detection scenarios.

Abstract

The on-board 3D object detection technology has received extensive attention as a critical technology for autonomous driving, while few studies have focused on applying roadside sensors in 3D traffic object detection. Existing studies achieve the projection of 2D image features to 3D features through height estimation based on the frustum. However, they did not consider the height alignment and the extraction efficiency of bird's-eye-view features. We propose a novel 3D object detection framework integrating Spatial Former and Voxel Pooling Former to enhance 2D-to-3D projection based on height estimation. Extensive experiments were conducted using the Rope3D and DAIR-V2X-I dataset, and the results demonstrated the outperformance of the proposed algorithm in the detection of both vehicles and cyclists. These results indicate that the algorithm is robust and generalized under various detection scenarios. Improving the accuracy of 3D object detection on the roadside is conducive to building a safe and trustworthy intelligent transportation system of vehicle-road coordination and promoting the large-scale application of autonomous driving. The code and pre-trained models will be released on https://anonymous.4open.science/r/HeightFormer.

HeightFormer: A Semantic Alignment Monocular 3D Object Detection Method from Roadside Perspective

TL;DR

A novel 3D object detection framework integrating Spatial Former and Voxel Pooling Former to enhance 2D-to-3D projection based on height estimation is proposed and results indicate that the algorithm is robust and generalized under various detection scenarios.

Abstract

The on-board 3D object detection technology has received extensive attention as a critical technology for autonomous driving, while few studies have focused on applying roadside sensors in 3D traffic object detection. Existing studies achieve the projection of 2D image features to 3D features through height estimation based on the frustum. However, they did not consider the height alignment and the extraction efficiency of bird's-eye-view features. We propose a novel 3D object detection framework integrating Spatial Former and Voxel Pooling Former to enhance 2D-to-3D projection based on height estimation. Extensive experiments were conducted using the Rope3D and DAIR-V2X-I dataset, and the results demonstrated the outperformance of the proposed algorithm in the detection of both vehicles and cyclists. These results indicate that the algorithm is robust and generalized under various detection scenarios. Improving the accuracy of 3D object detection on the roadside is conducive to building a safe and trustworthy intelligent transportation system of vehicle-road coordination and promoting the large-scale application of autonomous driving. The code and pre-trained models will be released on https://anonymous.4open.science/r/HeightFormer.

Paper Structure

This paper contains 15 sections, 8 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: 3D Detection Box Generation and BEV Perspective Overview Diagram. The detection box generation adopts the 7-parameter method. In the middle figure, L, W, H represent length, width, and height, respectively, C represents the coordinate (x, y, z) of the center point of the detection box, and $\theta$ represents the yaw angle.
  • Figure 2: Overview of Our Method Architecture. The left-top is the input image; the image backbone extracted 2D features from the input image. After fusing the height features, context features and camera parameters, projected these features into 3D features by projector. Then, BEV features can be obtained by voxel pooling and the self-attention module. Finally, the 3D object detection head obtained detected results.
  • Figure 3: Deformable Multi-scale Spatial Cross-attention Fused Height Feature and Context Feature.
  • Figure 4: The Diagram Illustrating the Frustum Projection Using Height Estimation. The bounding box features detected by 2D are fused into the viewpoint and projected into the 3D space through height estimation.
  • Figure 5: Schematic Diagram of Obtaining BEV Features by Self-Attention. The input on the left is the feature map after voxel pooling. The feature map is matched with fixed-size patches, and the BEV feature map is obtained through the multi-head attention mechanism and the MLP module.
  • ...and 1 more figures