Table of Contents
Fetching ...

CurveFormer++: 3D Lane Detection by Curve Propagation with Temporal Curve Queries and Attention

Yifeng Bai, Zhirong Chen, Pengpeng Liang, Bo Song, Erkang Cheng

TL;DR

This work tackles monocular 3D lane detection by proposing CurveFormer++, a single-stage Transformer-based method that directly regresses 3D lane parameters from front-view features without a BEV view transformation. Lanes are represented as curves via dynamic anchor-point sets, refined by a curve cross-attention mechanism and a context sampling module, with an optional temporal fusion variant CurveFormer++-T that propagates information through history-sparse queries and anchor points. Key contributions include dynamic anchor point sets, a curve cross-attention module, an anchor-range restriction in sampling, and a lane-centric temporal propagation framework, validated by strong results on ONCE-3DLanes and OpenLane with comprehensive ablations. The approach reduces reliance on BEV mappings, improves temporal stability, and holds promise for real-time autonomous driving perception, with future potential in multi-camera and multi-modal extensions.

Abstract

In autonomous driving, accurate 3D lane detection using monocular cameras is important for downstream tasks. Recent CNN and Transformer approaches usually apply a two-stage model design. The first stage transforms the image feature from a front image into a bird's-eye-view (BEV) representation. Subsequently, a sub-network processes the BEV feature to generate the 3D detection results. However, these approaches heavily rely on a challenging image feature transformation module from a perspective view to a BEV representation. In our work, we present CurveFormer++, a single-stage Transformer-based method that does not require the view transform module and directly infers 3D lane results from the perspective image features. Specifically, our approach models the 3D lane detection task as a curve propagation problem, where each lane is represented by a curve query with a dynamic and ordered anchor point set. By employing a Transformer decoder, the model can iteratively refine the 3D lane results. A curve cross-attention module is introduced to calculate similarities between image features and curve queries. To handle varying lane lengths, we employ context sampling and anchor point restriction techniques to compute more relevant image features. Furthermore, we apply a temporal fusion module that incorporates selected informative sparse curve queries and their corresponding anchor point sets to leverage historical information. In the experiments, we evaluate our approach on two publicly real-world datasets. The results demonstrate that our method provides outstanding performance compared with both CNN and Transformer based methods. We also conduct ablation studies to analyze the impact of each component.

CurveFormer++: 3D Lane Detection by Curve Propagation with Temporal Curve Queries and Attention

TL;DR

This work tackles monocular 3D lane detection by proposing CurveFormer++, a single-stage Transformer-based method that directly regresses 3D lane parameters from front-view features without a BEV view transformation. Lanes are represented as curves via dynamic anchor-point sets, refined by a curve cross-attention mechanism and a context sampling module, with an optional temporal fusion variant CurveFormer++-T that propagates information through history-sparse queries and anchor points. Key contributions include dynamic anchor point sets, a curve cross-attention module, an anchor-range restriction in sampling, and a lane-centric temporal propagation framework, validated by strong results on ONCE-3DLanes and OpenLane with comprehensive ablations. The approach reduces reliance on BEV mappings, improves temporal stability, and holds promise for real-time autonomous driving perception, with future potential in multi-camera and multi-modal extensions.

Abstract

In autonomous driving, accurate 3D lane detection using monocular cameras is important for downstream tasks. Recent CNN and Transformer approaches usually apply a two-stage model design. The first stage transforms the image feature from a front image into a bird's-eye-view (BEV) representation. Subsequently, a sub-network processes the BEV feature to generate the 3D detection results. However, these approaches heavily rely on a challenging image feature transformation module from a perspective view to a BEV representation. In our work, we present CurveFormer++, a single-stage Transformer-based method that does not require the view transform module and directly infers 3D lane results from the perspective image features. Specifically, our approach models the 3D lane detection task as a curve propagation problem, where each lane is represented by a curve query with a dynamic and ordered anchor point set. By employing a Transformer decoder, the model can iteratively refine the 3D lane results. A curve cross-attention module is introduced to calculate similarities between image features and curve queries. To handle varying lane lengths, we employ context sampling and anchor point restriction techniques to compute more relevant image features. Furthermore, we apply a temporal fusion module that incorporates selected informative sparse curve queries and their corresponding anchor point sets to leverage historical information. In the experiments, we evaluate our approach on two publicly real-world datasets. The results demonstrate that our method provides outstanding performance compared with both CNN and Transformer based methods. We also conduct ablation studies to analyze the impact of each component.
Paper Structure (19 sections, 12 equations, 7 figures, 8 tables)

This paper contains 19 sections, 12 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Comparisons of different 3D lane detection pipelines. (a) 2D image prediction and post-processing; (b) 3D lane detection with camera extrinsic prediction; (c) Transformer-based dense BEV map construction and 3D lane prediction; (d) Our proposed CurveFormer++, directly provides 3D lane parameters by sparse curve queries with curve cross-attention mechanism in Transformer decoder.
  • Figure 2: Comparisons of different Transformer-based temporal information fusion approaches for 3D lane detection.
  • Figure 3: Overview of our proposed CurveFormer++ for single-frame 3D lane detection (a) & temporal propagation fusion block in CurveFormer++-T (b).
  • Figure 4: Illustration of the curve query representation with dynamic anchor point set in the X-O-Y plane and Z-O-Y plane. (a) the iterative refinement visualization in the image view (b). Each dynamic anchor point set initially follows a standard normal distribution.
  • Figure 5: Illustration of the Context Sampling Module. Our context sampling module learns sampling offsets by leveraging both queries and image features.
  • ...and 2 more figures