Table of Contents
Fetching ...

DV-3DLane: End-to-end Multi-modal 3D Lane Detection with Dual-view Representation

Yueru Luo, Shuguang Cui, Zhen Li

TL;DR

DV-3DLane tackles robust 3D lane detection by fusing camera and LiDAR information in dual PV and BEV views. It introduces Bidirectional Feature Fusion, Unified Query Generator, and 3D Dual-view Deformable Attention to jointly learn and integrate lane-aware features across views. On OpenLane, it achieves state-of-the-art performance with an 11.2-point improvement in F1 at a 0.5 m threshold and substantial reductions in localization errors, validating the value of preserving and cross-fusing dual-view representations. The proposed framework offers a practical, end-to-end multi-modal pipeline that improves lane perception under varying lighting, weather, and road geometries.

Abstract

Accurate 3D lane estimation is crucial for ensuring safety in autonomous driving. However, prevailing monocular techniques suffer from depth loss and lighting variations, hampering accurate 3D lane detection. In contrast, LiDAR points offer geometric cues and enable precise localization. In this paper, we present DV-3DLane, a novel end-to-end Dual-View multi-modal 3D Lane detection framework that synergizes the strengths of both images and LiDAR points. We propose to learn multi-modal features in dual-view spaces, i.e., perspective view (PV) and bird's-eye-view (BEV), effectively leveraging the modal-specific information. To achieve this, we introduce three designs: 1) A bidirectional feature fusion strategy that integrates multi-modal features into each view space, exploiting their unique strengths. 2) A unified query generation approach that leverages lane-aware knowledge from both PV and BEV spaces to generate queries. 3) A 3D dual-view deformable attention mechanism, which aggregates discriminative features from both PV and BEV spaces into queries for accurate 3D lane detection. Extensive experiments on the public benchmark, OpenLane, demonstrate the efficacy and efficiency of DV-3DLane. It achieves state-of-the-art performance, with a remarkable 11.2 gain in F1 score and a substantial 53.5% reduction in errors. The code is available at \url{https://github.com/JMoonr/dv-3dlane}.

DV-3DLane: End-to-end Multi-modal 3D Lane Detection with Dual-view Representation

TL;DR

DV-3DLane tackles robust 3D lane detection by fusing camera and LiDAR information in dual PV and BEV views. It introduces Bidirectional Feature Fusion, Unified Query Generator, and 3D Dual-view Deformable Attention to jointly learn and integrate lane-aware features across views. On OpenLane, it achieves state-of-the-art performance with an 11.2-point improvement in F1 at a 0.5 m threshold and substantial reductions in localization errors, validating the value of preserving and cross-fusing dual-view representations. The proposed framework offers a practical, end-to-end multi-modal pipeline that improves lane perception under varying lighting, weather, and road geometries.

Abstract

Accurate 3D lane estimation is crucial for ensuring safety in autonomous driving. However, prevailing monocular techniques suffer from depth loss and lighting variations, hampering accurate 3D lane detection. In contrast, LiDAR points offer geometric cues and enable precise localization. In this paper, we present DV-3DLane, a novel end-to-end Dual-View multi-modal 3D Lane detection framework that synergizes the strengths of both images and LiDAR points. We propose to learn multi-modal features in dual-view spaces, i.e., perspective view (PV) and bird's-eye-view (BEV), effectively leveraging the modal-specific information. To achieve this, we introduce three designs: 1) A bidirectional feature fusion strategy that integrates multi-modal features into each view space, exploiting their unique strengths. 2) A unified query generation approach that leverages lane-aware knowledge from both PV and BEV spaces to generate queries. 3) A 3D dual-view deformable attention mechanism, which aggregates discriminative features from both PV and BEV spaces into queries for accurate 3D lane detection. Extensive experiments on the public benchmark, OpenLane, demonstrate the efficacy and efficiency of DV-3DLane. It achieves state-of-the-art performance, with a remarkable 11.2 gain in F1 score and a substantial 53.5% reduction in errors. The code is available at \url{https://github.com/JMoonr/dv-3dlane}.

Paper Structure

This paper contains 28 sections, 6 equations, 8 figures, 8 tables, 2 algorithms.

Figures (8)

  • Figure 1: FPS vs. F1 score. All models are tested on a single V100 GPU, and F1-score is evaluated with a harsh distance threshold of 0.5m on the OpenLane-1K dataset. Our model sets a new state-of-the-art, and our tiny version surpasses all previous methods with the fastest FPS. More details can be found in \ref{['tab:main_results']} and the Appendix.
  • Figure 1: More Results. Rows (a), (b), (c) show projections of 3D lanes from the ground truth (GT), DV-3DLane, and LATRluo2023latr, with differences highlighted by colored arrows. Row (d) compares GT (red) and our prediction (green) in 3D. Best viewed in color and zoom in for details.
  • Figure 2: Overview of DV-3DLane. First, images and point clouds undergo separate processing by the image backbone and point backbone. In the middle stage of backbones, we introduce Bidirectional Feature Fusion (BFF) to fuse multi-modal features across views. Subsequently, the instance activation map (IAM) is utilized to produce lane-aware queries $\mathbf{Q}_{pv}$ and $\mathbf{Q}_{bev}$. These queries are then subjected to Dual-view Query Clustering, which aggregates dual-view query sets $\mathbf{Q}_{pv}$ and $\mathbf{Q}_{bev}$ into a unified query set $\mathbf{C}$, further augmented with learnable point embeddings $\mathbf{E}_{points}$ to form query $\mathbf{Q}$. Additionally, we introduce 3D Dual-view Deformable Attention to consistently aggregate point features from both view features $\mathbf{F}_{pv}$ and $\mathbf{F}_{bev}$ into $\mathbf{Q}$. $\oplus$ denotes broadcast summation. Notably, the $\oplus \, \mathbf{E}_{points}$ operation is performed only in the first layer, while in the following layer, $\oplus \, \mathbf{Q}$ is utilized. Different colored boxes denote queries targeting different lanes; dashed boxes represent the background, and box texture indicates features.
  • Figure 3: Bidirectional Feature Fusion (BFF)
  • Figure 4: Illustration of one-to-one matching and lane-centric clustering. (a) and (b) show the assignment for BEV and PV predictions, respectively. (c) depicts the pairing of the clustering, where queries targeting the same lane are treated as a positive pair, otherwise negative.
  • ...and 3 more figures