Table of Contents
Fetching ...

FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models

Yiming Yang, Hongbin Lin, Yueru Luo, Suzhong Fu, Chao Zheng, Xinrui Yan, Shuqi Mei, Kun Tang, Shuguang Cui, Zhen Li

TL;DR

FASTopoWM, a novel fast-slow lane segment topology reasoning framework augmented with latent world models, is proposed, which substantially improves the performance of temporal perception within the slow pipeline.

Abstract

Lane segment topology reasoning provides comprehensive bird's-eye view (BEV) road scene understanding, which can serve as a key perception module in planning-oriented end-to-end autonomous driving systems. Existing lane topology reasoning methods often fall short in effectively leveraging temporal information to enhance detection and reasoning performance. Recently, stream-based temporal propagation method has demonstrated promising results by incorporating temporal cues at both the query and BEV levels. However, it remains limited by over-reliance on historical queries, vulnerability to pose estimation failures, and insufficient temporal propagation. To overcome these limitations, we propose FASTopoWM, a novel fast-slow lane segment topology reasoning framework augmented with latent world models. To reduce the impact of pose estimation failures, this unified framework enables parallel supervision of both historical and newly initialized queries, facilitating mutual reinforcement between the fast and slow systems. Furthermore, we introduce latent query and BEV world models conditioned on the action latent to propagate the state representations from past observations to the current timestep. This design substantially improves the performance of temporal perception within the slow pipeline. Extensive experiments on the OpenLane-V2 benchmark demonstrate that FASTopoWM outperforms state-of-the-art methods in both lane segment detection (37.4% v.s. 33.6% on mAP) and centerline perception (46.3% v.s. 41.5% on OLS).

FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models

TL;DR

FASTopoWM, a novel fast-slow lane segment topology reasoning framework augmented with latent world models, is proposed, which substantially improves the performance of temporal perception within the slow pipeline.

Abstract

Lane segment topology reasoning provides comprehensive bird's-eye view (BEV) road scene understanding, which can serve as a key perception module in planning-oriented end-to-end autonomous driving systems. Existing lane topology reasoning methods often fall short in effectively leveraging temporal information to enhance detection and reasoning performance. Recently, stream-based temporal propagation method has demonstrated promising results by incorporating temporal cues at both the query and BEV levels. However, it remains limited by over-reliance on historical queries, vulnerability to pose estimation failures, and insufficient temporal propagation. To overcome these limitations, we propose FASTopoWM, a novel fast-slow lane segment topology reasoning framework augmented with latent world models. To reduce the impact of pose estimation failures, this unified framework enables parallel supervision of both historical and newly initialized queries, facilitating mutual reinforcement between the fast and slow systems. Furthermore, we introduce latent query and BEV world models conditioned on the action latent to propagate the state representations from past observations to the current timestep. This design substantially improves the performance of temporal perception within the slow pipeline. Extensive experiments on the OpenLane-V2 benchmark demonstrate that FASTopoWM outperforms state-of-the-art methods in both lane segment detection (37.4% v.s. 33.6% on mAP) and centerline perception (46.3% v.s. 41.5% on OLS).

Paper Structure

This paper contains 27 sections, 20 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Pipeline Comparison. Existing stream-based methods suffer significant performance degradation when pose estimation is unavailable. Our approach addresses this issue by incorporating fast-slow pipelines and two latent world models.
  • Figure 2: Overall Framework.(a) Encoder. Historical queries and BEV features are processed by world models, conditioned on action latent, to predict the stream queries and stream BEV features. Multi-view images are encoded into BEV features, which are then fused with the stream BEV features. (b) Decoder. The slow and fast systems share the same Transformer layers and prediction heads to enable parallel supervision of both stream and newly initialized queries. T represents the frame at timestep T.
  • Figure 3: Comparison with the state-of-the-arts on OpenLane-V2 subsetB on centerline perception. All models adopt ResNet-50 as the backbone network and are trained for 24 epochs. TopoFormer$^\star$ adopts a staged training strategy that utilizes a pretrained lane detector for topology reasoning training. While this leads to better detection performance, it offers only slight advantage in topology prediction.
  • Figure 4: Qualitative results of baseline and our FASTopoWM. The baseline (BL) is LaneSegNet with stream-based temporal propagation. For better viewing, zoom in on the image.
  • Figure 5: Ablation studies on different modules. The baseline is LaneSegNet with stream-based temporal propagation. FS denotes the fast-slow system. QWM and BWM represent the query world model and BEV world model, respectively. Tem. and Sin. indicate temporal and single-frame detection.
  • ...and 1 more figures