Table of Contents
Fetching ...

HeightLane: BEV Heightmap guided 3D Lane Detection

Chaesong Park, Eunbin Seo, Jongwoo Lim

TL;DR

HeightLane tackles monocular 3D lane detection by address­ing depth ambiguity and ground slope variation through a dense BEV ground heightmap. It predicts a heightmap $\\mathcal{H} \\in \\mathbb{R}^{H' \\times W'}$ on a predefined BEV grid $\\mathbf{B} \\in \\mathbb{R}^{H' \\times W'}$ using multi-slope anchors $\\Theta$, and uses this height information to guide a deformable attention-based spatial transform from front-view to BEV space. The heightmap is jointly learned with a dense sampling mechanism and supervised with LiDAR-derived ground truth from Waymo, enabling precise ground-aware sampling and positional encoding in BEV. On OpenLane, HeightLane achieves state-of-the-art F-score, particularly in curved and junction-rich scenarios, demonstrating robust real-world applicability for monocular 3D lane detection and signaling a strong advancement over flat-ground or 2-DoF ground models.

Abstract

Accurate 3D lane detection from monocular images presents significant challenges due to depth ambiguity and imperfect ground modeling. Previous attempts to model the ground have often used a planar ground assumption with limited degrees of freedom, making them unsuitable for complex road environments with varying slopes. Our study introduces HeightLane, an innovative method that predicts a height map from monocular images by creating anchors based on a multi-slope assumption. This approach provides a detailed and accurate representation of the ground. HeightLane employs the predicted heightmap along with a deformable attention-based spatial feature transform framework to efficiently convert 2D image features into 3D bird's eye view (BEV) features, enhancing spatial understanding and lane structure recognition. Additionally, the heightmap is used for the positional encoding of BEV features, further improving their spatial accuracy. This explicit view transformation bridges the gap between front-view perceptions and spatially accurate BEV representations, significantly improving detection performance. To address the lack of the necessary ground truth (GT) height map in the original OpenLane dataset, we leverage the Waymo dataset and accumulate its LiDAR data to generate a height map for the drivable area of each scene. The GT heightmaps are used to train the heightmap extraction module from monocular images. Extensive experiments on the OpenLane validation set show that HeightLane achieves state-of-the-art performance in terms of F-score, highlighting its potential in real-world applications.

HeightLane: BEV Heightmap guided 3D Lane Detection

TL;DR

HeightLane tackles monocular 3D lane detection by address­ing depth ambiguity and ground slope variation through a dense BEV ground heightmap. It predicts a heightmap on a predefined BEV grid using multi-slope anchors , and uses this height information to guide a deformable attention-based spatial transform from front-view to BEV space. The heightmap is jointly learned with a dense sampling mechanism and supervised with LiDAR-derived ground truth from Waymo, enabling precise ground-aware sampling and positional encoding in BEV. On OpenLane, HeightLane achieves state-of-the-art F-score, particularly in curved and junction-rich scenarios, demonstrating robust real-world applicability for monocular 3D lane detection and signaling a strong advancement over flat-ground or 2-DoF ground models.

Abstract

Accurate 3D lane detection from monocular images presents significant challenges due to depth ambiguity and imperfect ground modeling. Previous attempts to model the ground have often used a planar ground assumption with limited degrees of freedom, making them unsuitable for complex road environments with varying slopes. Our study introduces HeightLane, an innovative method that predicts a height map from monocular images by creating anchors based on a multi-slope assumption. This approach provides a detailed and accurate representation of the ground. HeightLane employs the predicted heightmap along with a deformable attention-based spatial feature transform framework to efficiently convert 2D image features into 3D bird's eye view (BEV) features, enhancing spatial understanding and lane structure recognition. Additionally, the heightmap is used for the positional encoding of BEV features, further improving their spatial accuracy. This explicit view transformation bridges the gap between front-view perceptions and spatially accurate BEV representations, significantly improving detection performance. To address the lack of the necessary ground truth (GT) height map in the original OpenLane dataset, we leverage the Waymo dataset and accumulate its LiDAR data to generate a height map for the drivable area of each scene. The GT heightmaps are used to train the heightmap extraction module from monocular images. Extensive experiments on the OpenLane validation set show that HeightLane achieves state-of-the-art performance in terms of F-score, highlighting its potential in real-world applications.
Paper Structure (19 sections, 13 equations, 6 figures, 5 tables)

This paper contains 19 sections, 13 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: (a) Assuming the ground is a flat plane, 2D images or features can be transformed into BEV features using IPMPersFormer. (b) Modeling the ground as a plane with 2 degrees of freedom (2-DoF), such as pitch and height, provides more generality and is used by LATR LATR for positional encoding in the transformer. (c) Our method predicts a dense height map to spatially transform 2D image features onto a predefined BEV feature grid. Bold indicates how each method represents the ground.
  • Figure 2: Overall Architecture of HeightLane. HeightLane takes a 2D image as input and extracts multi-scale front-view features through a CNN backbone. Using predefined multi-slope heightmap anchors, the extrinsic matrix T, and the intrinsic matrix K, the 2D front-view features are sampled onto a BEV grid to obtain BEV height feature. BEV height feature is then processed through a CNN layer to predict the heightmap. The predicted heightmap is used in spatial feature transformation, where the initial BEV feature query and heightmap determine the reference pixels that the query should refer to in the front-view features. The front-view features serve as keys and values, while the BEV features act as queries. This process, through deformable attention, produces enhanced BEV feature queries.
  • Figure 3: LiDAR accumulation results for the Up&Down scenario in the OpenLane PersFormer validation set. The color bar on the left represents color values corresponding to the road height.
  • Figure 4: Structure of the Height-Guided Spatial Transform Framework using deformable attention PersFormerzhu2020deformable. Flattened BEV queries receive height positional encoding during self-attention, and in cross-attention, the heightmap maps BEV queries to image pixels. Deformable attention then learns offsets to generate multiple reference points.
  • Figure 5: Qualitative evaluation on the OpenLane's validation set. Compared with the existing best performing model, LATRLATR. First row: input image. Second row: 3D lane detection results - Ground truth (red), HeightLane (green), LATR (blue). Third row: ground truth and HeightLane in Y-Z plane. Fourth row: Ground truth and LATR in Y-Z plane. Zoom in to see details.
  • ...and 1 more figures