HeightFormer: Learning Height Prediction in Voxel Features for Roadside Vision Centric 3D Object Detection via Transformer
Zhang Zhang, Chao Sun, Chao Yue, Da Wen, Yujie Chen, Tianze Wang, Jianghao Leng
TL;DR
HeightFormer addresses roadside vision-centric 3D object detection by predicting height distributions directly in voxel features using a transformer that operates on local height sequences. The method maps image features to voxel space, applies height-aware self-attention to refine height cues, and decodes BEV features for detection, balancing accuracy and efficiency. It achieves state-of-the-art results on DAIR-V2X-I and Rope3D, outperforming image-based and BEV-based height methods as well as other voxel operators, with notable gains in long-range and dense-scene scenarios. By preserving explicit height information in 3D space and restricting attention to local height sequences, HeightFormer offers robust, scalable roadside perception suitable for real-world deployment.
Abstract
Roadside vision centric 3D object detection has received increasing attention in recent years. It expands the perception range of autonomous vehicles, enhances the road safety. Previous methods focused on predicting per-pixel height rather than depth, making significant gains in roadside visual perception. While it is limited by the perspective property of near-large and far-small on image features, making it difficult for network to understand real dimension of objects in the 3D world. BEV features and voxel features present the real distribution of objects in 3D world compared to the image features. However, BEV features tend to lose details due to the lack of explicit height information, and voxel features are computationally expensive. Inspired by this insight, an efficient framework learning height prediction in voxel features via transformer is proposed, dubbed HeightFormer. It groups the voxel features into local height sequences, and utilize attention mechanism to obtain height distribution prediction. Subsequently, the local height sequences are reassembled to generate accurate 3D features. The proposed method is applied to two large-scale roadside benchmarks, DAIR-V2X-I and Rope3D. Extensive experiments are performed and the HeightFormer outperforms the state-of-the-art methods in roadside vision centric 3D object detection task.
