Table of Contents
Fetching ...

Divide and Merge: Motion and Semantic Learning in End-to-End Autonomous Driving

Yinzhe Shen, Omer Sahin Tas, Kaiwen Wang, Royden Wagner, Christoph Stiller

TL;DR

This paper tackles negative transfer in end-to-end autonomous driving by decoupling semantic and motion learning through DMAD, which deploys a Neural-Bayes motion decoder and an Interactive Semantic Decoder that share reference points but propagate gradients separately. The approach enables concurrent perception, tracking, and prediction while fostering mutual semantic exchange between object and map perception, leading to improvements across perception, prediction, and planning on nuScenes when integrated with UniAD and SparseDrive. Key contributions include the decoupled motion learning via Bayes-filter–inspired recursion, bidirectional semantic interaction, and comprehensive ablations with SHAP-based insights. The results demonstrate that dividing and then merging heterogeneous information yields superior downstream planning performance and safety metrics, with practical impact for robust, end-to-end autonomous driving systems.

Abstract

Perceiving the environment and its changes over time corresponds to two fundamental yet heterogeneous types of information: semantics and motion. Previous end-to-end autonomous driving works represent both types of information in a single feature vector. However, including motion related tasks, such as prediction and planning, impairs detection and tracking performance, a phenomenon known as negative transfer in multi-task learning. To address this issue, we propose Neural-Bayes motion decoding, a novel parallel detection, tracking, and prediction method that separates semantic and motion learning. Specifically, we employ a set of learned motion queries that operate in parallel with detection and tracking queries, sharing a unified set of recursively updated reference points. Moreover, we employ interactive semantic decoding to enhance information exchange in semantic tasks, promoting positive transfer. Experiments on the nuScenes dataset with UniAD and SparseDrive confirm the effectiveness of our divide and merge approach, resulting in performance improvements across perception, prediction, and planning. Our code is available at https://github.com/shenyinzhe/DMAD.

Divide and Merge: Motion and Semantic Learning in End-to-End Autonomous Driving

TL;DR

This paper tackles negative transfer in end-to-end autonomous driving by decoupling semantic and motion learning through DMAD, which deploys a Neural-Bayes motion decoder and an Interactive Semantic Decoder that share reference points but propagate gradients separately. The approach enables concurrent perception, tracking, and prediction while fostering mutual semantic exchange between object and map perception, leading to improvements across perception, prediction, and planning on nuScenes when integrated with UniAD and SparseDrive. Key contributions include the decoupled motion learning via Bayes-filter–inspired recursion, bidirectional semantic interaction, and comprehensive ablations with SHAP-based insights. The results demonstrate that dividing and then merging heterogeneous information yields superior downstream planning performance and safety metrics, with practical impact for robust, end-to-end autonomous driving systems.

Abstract

Perceiving the environment and its changes over time corresponds to two fundamental yet heterogeneous types of information: semantics and motion. Previous end-to-end autonomous driving works represent both types of information in a single feature vector. However, including motion related tasks, such as prediction and planning, impairs detection and tracking performance, a phenomenon known as negative transfer in multi-task learning. To address this issue, we propose Neural-Bayes motion decoding, a novel parallel detection, tracking, and prediction method that separates semantic and motion learning. Specifically, we employ a set of learned motion queries that operate in parallel with detection and tracking queries, sharing a unified set of recursively updated reference points. Moreover, we employ interactive semantic decoding to enhance information exchange in semantic tasks, promoting positive transfer. Experiments on the nuScenes dataset with UniAD and SparseDrive confirm the effectiveness of our divide and merge approach, resulting in performance improvements across perception, prediction, and planning. Our code is available at https://github.com/shenyinzhe/DMAD.

Paper Structure

This paper contains 24 sections, 7 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Comparison of structures. In (a), semantic and motion learning occur sequentially. In (b), the multi-head structure parallelizes tasks with different heads; however, motion and semantic learning remain sequential in detection, tracking, and prediction. In (c), semantic and motion learning are performed in parallel without latent feature sharing or gradient propagation. In contrast, the exchange of information between the object and map perception modules is enhanced.
  • Figure 2: An overview of DMAD. A backbone processes multi-view images into sensor embeddings. Map and object queries are initialized, then interactively attend to the sensor embeddings for map and object perception. Motion queries, mapped one-to-one with object queries, share reference points that are iteratively updated. Finally, motion queries corresponding to detected objects are decoded into future trajectories. The ego motion query ("e") is used for planning. Gray dashed lines indicate operations without gradient flow.
  • Figure 3:
  • Figure 4: Neural-Bayes motion decoding. After each decoding layer, the semantic decoder updates the reference points, which are then shared with the motion decoder. At the end of each frame, positive object query indices are used to select corresponding motion queries and are together propagated to the subsequent frame, with the motion query predictions serving as reference points for the next frame. This process is similar to the measurement, updating, and prediction steps in a Bayes filter. Map queries, ego queries and sensor embeddings are omitted for simplicity.
  • Figure 5: Qualitative comparison between DMAD and UniAD. Each subfigure demonstrates a sample where UniAD encounters collision while DMAD does not.
  • ...and 5 more figures