CurveFormer++: 3D Lane Detection by Curve Propagation with Temporal Curve Queries and Attention
Yifeng Bai, Zhirong Chen, Pengpeng Liang, Bo Song, Erkang Cheng
TL;DR
This work tackles monocular 3D lane detection by proposing CurveFormer++, a single-stage Transformer-based method that directly regresses 3D lane parameters from front-view features without a BEV view transformation. Lanes are represented as curves via dynamic anchor-point sets, refined by a curve cross-attention mechanism and a context sampling module, with an optional temporal fusion variant CurveFormer++-T that propagates information through history-sparse queries and anchor points. Key contributions include dynamic anchor point sets, a curve cross-attention module, an anchor-range restriction in sampling, and a lane-centric temporal propagation framework, validated by strong results on ONCE-3DLanes and OpenLane with comprehensive ablations. The approach reduces reliance on BEV mappings, improves temporal stability, and holds promise for real-time autonomous driving perception, with future potential in multi-camera and multi-modal extensions.
Abstract
In autonomous driving, accurate 3D lane detection using monocular cameras is important for downstream tasks. Recent CNN and Transformer approaches usually apply a two-stage model design. The first stage transforms the image feature from a front image into a bird's-eye-view (BEV) representation. Subsequently, a sub-network processes the BEV feature to generate the 3D detection results. However, these approaches heavily rely on a challenging image feature transformation module from a perspective view to a BEV representation. In our work, we present CurveFormer++, a single-stage Transformer-based method that does not require the view transform module and directly infers 3D lane results from the perspective image features. Specifically, our approach models the 3D lane detection task as a curve propagation problem, where each lane is represented by a curve query with a dynamic and ordered anchor point set. By employing a Transformer decoder, the model can iteratively refine the 3D lane results. A curve cross-attention module is introduced to calculate similarities between image features and curve queries. To handle varying lane lengths, we employ context sampling and anchor point restriction techniques to compute more relevant image features. Furthermore, we apply a temporal fusion module that incorporates selected informative sparse curve queries and their corresponding anchor point sets to leverage historical information. In the experiments, we evaluate our approach on two publicly real-world datasets. The results demonstrate that our method provides outstanding performance compared with both CNN and Transformer based methods. We also conduct ablation studies to analyze the impact of each component.
