Table of Contents
Fetching ...

HeightFormer: Learning Height Prediction in Voxel Features for Roadside Vision Centric 3D Object Detection via Transformer

Zhang Zhang, Chao Sun, Chao Yue, Da Wen, Yujie Chen, Tianze Wang, Jianghao Leng

TL;DR

HeightFormer addresses roadside vision-centric 3D object detection by predicting height distributions directly in voxel features using a transformer that operates on local height sequences. The method maps image features to voxel space, applies height-aware self-attention to refine height cues, and decodes BEV features for detection, balancing accuracy and efficiency. It achieves state-of-the-art results on DAIR-V2X-I and Rope3D, outperforming image-based and BEV-based height methods as well as other voxel operators, with notable gains in long-range and dense-scene scenarios. By preserving explicit height information in 3D space and restricting attention to local height sequences, HeightFormer offers robust, scalable roadside perception suitable for real-world deployment.

Abstract

Roadside vision centric 3D object detection has received increasing attention in recent years. It expands the perception range of autonomous vehicles, enhances the road safety. Previous methods focused on predicting per-pixel height rather than depth, making significant gains in roadside visual perception. While it is limited by the perspective property of near-large and far-small on image features, making it difficult for network to understand real dimension of objects in the 3D world. BEV features and voxel features present the real distribution of objects in 3D world compared to the image features. However, BEV features tend to lose details due to the lack of explicit height information, and voxel features are computationally expensive. Inspired by this insight, an efficient framework learning height prediction in voxel features via transformer is proposed, dubbed HeightFormer. It groups the voxel features into local height sequences, and utilize attention mechanism to obtain height distribution prediction. Subsequently, the local height sequences are reassembled to generate accurate 3D features. The proposed method is applied to two large-scale roadside benchmarks, DAIR-V2X-I and Rope3D. Extensive experiments are performed and the HeightFormer outperforms the state-of-the-art methods in roadside vision centric 3D object detection task.

HeightFormer: Learning Height Prediction in Voxel Features for Roadside Vision Centric 3D Object Detection via Transformer

TL;DR

HeightFormer addresses roadside vision-centric 3D object detection by predicting height distributions directly in voxel features using a transformer that operates on local height sequences. The method maps image features to voxel space, applies height-aware self-attention to refine height cues, and decodes BEV features for detection, balancing accuracy and efficiency. It achieves state-of-the-art results on DAIR-V2X-I and Rope3D, outperforming image-based and BEV-based height methods as well as other voxel operators, with notable gains in long-range and dense-scene scenarios. By preserving explicit height information in 3D space and restricting attention to local height sequences, HeightFormer offers robust, scalable roadside perception suitable for real-world deployment.

Abstract

Roadside vision centric 3D object detection has received increasing attention in recent years. It expands the perception range of autonomous vehicles, enhances the road safety. Previous methods focused on predicting per-pixel height rather than depth, making significant gains in roadside visual perception. While it is limited by the perspective property of near-large and far-small on image features, making it difficult for network to understand real dimension of objects in the 3D world. BEV features and voxel features present the real distribution of objects in 3D world compared to the image features. However, BEV features tend to lose details due to the lack of explicit height information, and voxel features are computationally expensive. Inspired by this insight, an efficient framework learning height prediction in voxel features via transformer is proposed, dubbed HeightFormer. It groups the voxel features into local height sequences, and utilize attention mechanism to obtain height distribution prediction. Subsequently, the local height sequences are reassembled to generate accurate 3D features. The proposed method is applied to two large-scale roadside benchmarks, DAIR-V2X-I and Rope3D. Extensive experiments are performed and the HeightFormer outperforms the state-of-the-art methods in roadside vision centric 3D object detection task.

Paper Structure

This paper contains 12 sections, 12 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Where the orange arrow represents the height prediction network. (a) The method based on image features, which is limited by the image perspective properties, making it difficult to understand the 3D space. (b) The method based on BEV features, which is limited by the lack of explicit height information, making it difficult to accurately predict the height distribution. (c) The method based on voxel features, which is rich in spatial contextual information and explicit height information of objects. As shown in (d)(e), the proposed HeightFormer outperforms the state-of-the-art methods in roadside visual 3D object detection task.
  • Figure 2: (a) The proposed HeightFormer consists of five main stages. Image Encoder is composed of ResNet34 and FPN, and output image features which contain combined receptive fields through the feature pyramid network. View Transform projects the image features based on the mapping tables to obtain voxel features, which is computed by the intrinsic and extrinsic parameters of camera. Height Attention imposes transformer layers on the local height sequences and outputs the accurate voxel features. In BEV Decoder, voxel features are compressed in the height dimension to generate BEV features. Detection Head predicts the 3D bounding boxes based on BEV features. (b) The pipeline of transformer block.
  • Figure 3: The green represents the ground truth and the blue represents the predicted 3D bounding boxes. In Scene A, our HeightFormer achieves less False Positive and more accurate localization when facing dense pedestrians compared to the BEVSpread and BEVHeight. At the same time, it realizes more accurate vehicle yaw angle prediction. In Scene B, the proposed HeightFormer has more accurate overall dimension prediction and yaw angle estimation for large objects (e.g., bus), demonstrating its excellent ability to understand the 3D world. In Scene C, our HeightFormer achieves less False Negative when faced with the detection of small objects at long distances compared to the BEVSpread and BEVHeight.