Table of Contents
Fetching ...

DualAD: Disentangling the Dynamic and Static World for End-to-End Driving

Simon Doll, Niklas Hanselmann, Lukas Schneider, Richard Schulz, Marius Cordts, Markus Enzweiler, Hendrik P. A. Lensch

TL;DR

DualAD tackles the challenge of end-to-end driving by disentangling dynamic agents and static scene content into two dedicated streams within a transformer architecture. Dynamic objects are modeled with object-centric queries that attend to image features and are motion-compensated over time, while static scene elements are represented with a BEV grid that propagates with ego motion; a dynamic-static cross-attention block enables cross-stream information exchange. The approach yields state-of-the-art results on nuScenes perception tasks and provides tangible gains in downstream motion prediction and planning when integrated with recent end-to-end driving frameworks, demonstrating the importance of disentangled representations for temporal consistency in dynamic driving scenes. The work also offers extensive ablations and runtime analyses, confirming the value of the dual-stream design and cross-attention, and points to promising future directions with additional modalities and tasks to further boost real-time autonomous driving systems.

Abstract

State-of-the-art approaches for autonomous driving integrate multiple sub-tasks of the overall driving task into a single pipeline that can be trained in an end-to-end fashion by passing latent representations between the different modules. In contrast to previous approaches that rely on a unified grid to represent the belief state of the scene, we propose dedicated representations to disentangle dynamic agents and static scene elements. This allows us to explicitly compensate for the effect of both ego and object motion between consecutive time steps and to flexibly propagate the belief state through time. Furthermore, dynamic objects can not only attend to the input camera images, but also directly benefit from the inferred static scene structure via a novel dynamic-static cross-attention. Extensive experiments on the challenging nuScenes benchmark demonstrate the benefits of the proposed dual-stream design, especially for modelling highly dynamic agents in the scene, and highlight the improved temporal consistency of our approach. Our method titled DualAD not only outperforms independently trained single-task networks, but also improves over previous state-of-the-art end-to-end models by a large margin on all tasks along the functional chain of driving.

DualAD: Disentangling the Dynamic and Static World for End-to-End Driving

TL;DR

DualAD tackles the challenge of end-to-end driving by disentangling dynamic agents and static scene content into two dedicated streams within a transformer architecture. Dynamic objects are modeled with object-centric queries that attend to image features and are motion-compensated over time, while static scene elements are represented with a BEV grid that propagates with ego motion; a dynamic-static cross-attention block enables cross-stream information exchange. The approach yields state-of-the-art results on nuScenes perception tasks and provides tangible gains in downstream motion prediction and planning when integrated with recent end-to-end driving frameworks, demonstrating the importance of disentangled representations for temporal consistency in dynamic driving scenes. The work also offers extensive ablations and runtime analyses, confirming the value of the dual-stream design and cross-attention, and points to promising future directions with additional modalities and tasks to further boost real-time autonomous driving systems.

Abstract

State-of-the-art approaches for autonomous driving integrate multiple sub-tasks of the overall driving task into a single pipeline that can be trained in an end-to-end fashion by passing latent representations between the different modules. In contrast to previous approaches that rely on a unified grid to represent the belief state of the scene, we propose dedicated representations to disentangle dynamic agents and static scene elements. This allows us to explicitly compensate for the effect of both ego and object motion between consecutive time steps and to flexibly propagate the belief state through time. Furthermore, dynamic objects can not only attend to the input camera images, but also directly benefit from the inferred static scene structure via a novel dynamic-static cross-attention. Extensive experiments on the challenging nuScenes benchmark demonstrate the benefits of the proposed dual-stream design, especially for modelling highly dynamic agents in the scene, and highlight the improved temporal consistency of our approach. Our method titled DualAD not only outperforms independently trained single-task networks, but also improves over previous state-of-the-art end-to-end models by a large margin on all tasks along the functional chain of driving.
Paper Structure (16 sections, 4 figures, 14 tables)

This paper contains 16 sections, 4 figures, 14 tables.

Figures (4)

  • Figure 1: Comparison of Representation Design of unified grid-based approaches and our dual-stream design. By explicitly disentangling dynamic and static representations, the dynamic stream can aggregate highly descriptive features. This is achieved through direct attention to image features, as well as explicit compensation for object and ego motion, which is not feasible with unified grids.
  • Figure 2: DualAD Architecture: Two separate representations are chosen for dynamic agents ($\mathcal{Q}_{\text{obj}}$) and static elements ($\mathcal{Q}_{\text{BEV}}$) as shown in Fig. \ref{['fig:architecture']}. Self- and cross-attention is simultaneously performed in the proposed dual-stream transformer as shown in Fig. \ref{['fig:transformer_layer']}, paired with the novel dynamic-static cross-attention block to allow the dynamic agents to benefit from the inferred scene structure.
  • Figure 3: Qualitative Results. Fig. \ref{['fig:qualitative_uniad']} shows the output of DualAD for object tracking, map segmentation, motion prediction and planning. Fig. \ref{['fig:qualitative_vad']} shows the same scene for the vectorized version DualVAD of our approach.
  • Figure 4: Performance Comparison of DualAD and UniAD hu2023planning for two different scenes. Predictions are shown in orange, ground-truth annotations in blue, ego location with a red cross. While highly dynamic agents cause perception errors such as track losses or distorted objects for UniAD, DualAD consistently captures them due to the proposed dual-stream design.