Table of Contents
Fetching ...

HeightFormer: Explicit Height Modeling without Extra Data for Camera-only 3D Object Detection in Bird's Eye View

Yiming Wu, Ruixiang Li, Zequn Qin, Xinhai Zhao, Xi Li

TL;DR

The paper tackles camera-only 3D object detection in Bird's Eye View by addressing the ill-posed 2D-to-3D mapping with an explicit height modeling approach called HeightFormer. It proves a theoretical equivalence between height-based BEV construction and depth-based image mapping, enabling LiDAR-free supervision and robust cross-camera applicability. The network employs a self-recursive height predictor and a segmentation-based query mask to refine heights and suppress background, achieving state-of-the-art-like performance among camera-only methods on NuScenes. HeightFormer demonstrates robustness to camera rig variations, offers potential as a BEV feature refinement plug-in, and highlights practical benefits for LiDAR-free autonomous driving perception systems.

Abstract

Vision-based Bird's Eye View (BEV) representation is an emerging perception formulation for autonomous driving. The core challenge is to construct BEV space with multi-camera features, which is a one-to-many ill-posed problem. Diving into all previous BEV representation generation methods, we found that most of them fall into two types: modeling depths in image views or modeling heights in the BEV space, mostly in an implicit way. In this work, we propose to explicitly model heights in the BEV space, which needs no extra data like LiDAR and can fit arbitrary camera rigs and types compared to modeling depths. Theoretically, we give proof of the equivalence between height-based methods and depth-based methods. Considering the equivalence and some advantages of modeling heights, we propose HeightFormer, which models heights and uncertainties in a self-recursive way. Without any extra data, the proposed HeightFormer could estimate heights in BEV accurately. Benchmark results show that the performance of HeightFormer achieves SOTA compared with those camera-only methods.

HeightFormer: Explicit Height Modeling without Extra Data for Camera-only 3D Object Detection in Bird's Eye View

TL;DR

The paper tackles camera-only 3D object detection in Bird's Eye View by addressing the ill-posed 2D-to-3D mapping with an explicit height modeling approach called HeightFormer. It proves a theoretical equivalence between height-based BEV construction and depth-based image mapping, enabling LiDAR-free supervision and robust cross-camera applicability. The network employs a self-recursive height predictor and a segmentation-based query mask to refine heights and suppress background, achieving state-of-the-art-like performance among camera-only methods on NuScenes. HeightFormer demonstrates robustness to camera rig variations, offers potential as a BEV feature refinement plug-in, and highlights practical benefits for LiDAR-free autonomous driving perception systems.

Abstract

Vision-based Bird's Eye View (BEV) representation is an emerging perception formulation for autonomous driving. The core challenge is to construct BEV space with multi-camera features, which is a one-to-many ill-posed problem. Diving into all previous BEV representation generation methods, we found that most of them fall into two types: modeling depths in image views or modeling heights in the BEV space, mostly in an implicit way. In this work, we propose to explicitly model heights in the BEV space, which needs no extra data like LiDAR and can fit arbitrary camera rigs and types compared to modeling depths. Theoretically, we give proof of the equivalence between height-based methods and depth-based methods. Considering the equivalence and some advantages of modeling heights, we propose HeightFormer, which models heights and uncertainties in a self-recursive way. Without any extra data, the proposed HeightFormer could estimate heights in BEV accurately. Benchmark results show that the performance of HeightFormer achieves SOTA compared with those camera-only methods.
Paper Structure (38 sections, 20 equations, 11 figures, 9 tables)

This paper contains 38 sections, 20 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Well-posed 2D to 3D mapping with the extra condition added. In solution 1, extra depth information of images is introduced. In solution 2, extra height information in BEV is introduced.
  • Figure 2: Two solutions to the ill-posed 2D-3D mapping problem, which are equivalent. In (a), per-pixel depth will be estimated in images, and the feature at a pixel will be projected into a BEV voxel at the depth. In (b), per-grid heights will be estimated or defined in the BEV space. Each anchor is associated with a pixel, and the image feature at that pixel will be accumulated into the grid.
  • Figure 3: (a) Heatmap of the weighted heights. (b) The ground truth of bounding boxes under the bird's eye view. Heights are generated by weighting anchor heights with attention weights of spatial cross-attention. The heatmap shows that the information of heights is encoded into attention weights of spatial cross-attention.
  • Figure 4: Reference points. 3D reference points are chosen in the range of $[ y_{xz} - h_{xz}/2, y_{xz} + h_{xz}/2 ]$ at equal intervals. They will be projected into multi-view images to generate 2D reference points, and features around those 2D reference points will be gathered into the corresponding BEV queries.
  • Figure 5: Demo of the ground truth heights $y_{xz}$. The blue cubes are boundings boxes of objects. For the effect of visualization, they are stretched along the height dimension.
  • ...and 6 more figures