DualAD: Disentangling the Dynamic and Static World for End-to-End Driving
Simon Doll, Niklas Hanselmann, Lukas Schneider, Richard Schulz, Marius Cordts, Markus Enzweiler, Hendrik P. A. Lensch
TL;DR
DualAD tackles the challenge of end-to-end driving by disentangling dynamic agents and static scene content into two dedicated streams within a transformer architecture. Dynamic objects are modeled with object-centric queries that attend to image features and are motion-compensated over time, while static scene elements are represented with a BEV grid that propagates with ego motion; a dynamic-static cross-attention block enables cross-stream information exchange. The approach yields state-of-the-art results on nuScenes perception tasks and provides tangible gains in downstream motion prediction and planning when integrated with recent end-to-end driving frameworks, demonstrating the importance of disentangled representations for temporal consistency in dynamic driving scenes. The work also offers extensive ablations and runtime analyses, confirming the value of the dual-stream design and cross-attention, and points to promising future directions with additional modalities and tasks to further boost real-time autonomous driving systems.
Abstract
State-of-the-art approaches for autonomous driving integrate multiple sub-tasks of the overall driving task into a single pipeline that can be trained in an end-to-end fashion by passing latent representations between the different modules. In contrast to previous approaches that rely on a unified grid to represent the belief state of the scene, we propose dedicated representations to disentangle dynamic agents and static scene elements. This allows us to explicitly compensate for the effect of both ego and object motion between consecutive time steps and to flexibly propagate the belief state through time. Furthermore, dynamic objects can not only attend to the input camera images, but also directly benefit from the inferred static scene structure via a novel dynamic-static cross-attention. Extensive experiments on the challenging nuScenes benchmark demonstrate the benefits of the proposed dual-stream design, especially for modelling highly dynamic agents in the scene, and highlight the improved temporal consistency of our approach. Our method titled DualAD not only outperforms independently trained single-task networks, but also improves over previous state-of-the-art end-to-end models by a large margin on all tasks along the functional chain of driving.
