Table of Contents
Fetching ...

Decoupling Scene Perception and Ego Status: A Multi-Context Fusion Approach for Enhanced Generalization in End-to-End Autonomous Driving

Jiacheng Tang, Mingyue Feng, Jiachao Liu, Yaonong Wang, Jian Pu

TL;DR

AdaptiveAD tackles the problem of ego-status over-reliance and causal confusion in end-to-end autonomous driving by architecturally decoupling scene-driven perception from ego-driven planning through a dual-branch design, and then adaptively fusing their outputs with a scene-aware fusion module. It introduces a path-attention mechanism to improve ego-BEV interaction and two auxiliary tasks—BEV unidirectional distillation and autoregressive online mapping—to preserve multi-task learning and consistency. Empirically, AdaptiveAD achieves state-of-the-art open-loop planning performance on nuScenes, while significantly reducing reliance on ego priors and demonstrating strong scene-generalization and robustness to perturbations across additional benchmarks. This architectural paradigm—explicit decoupling of distinct reasoning contexts followed by adaptive fusion—offers a principled route toward safer, more generalizable autonomous driving systems and motivates integration with learned world models for enhanced causal reasoning.

Abstract

Modular design of planning-oriented autonomous driving has markedly advanced end-to-end systems. However, existing architectures remain constrained by an over-reliance on ego status, hindering generalization and robust scene understanding. We identify the root cause as an inherent design within these architectures that allows ego status to be easily leveraged as a shortcut. Specifically, the premature fusion of ego status in the upstream BEV encoder allows an information flow from this strong prior to dominate the downstream planning module. To address this challenge, we propose AdaptiveAD, an architectural-level solution based on a multi-context fusion strategy. Its core is a dual-branch structure that explicitly decouples scene perception and ego status. One branch performs scene-driven reasoning based on multi-task learning, but with ego status deliberately omitted from the BEV encoder, while the other conducts ego-driven reasoning based solely on the planning task. A scene-aware fusion module then adaptively integrates the complementary decisions from the two branches to form the final planning trajectory. To ensure this decoupling does not compromise multi-task learning, we introduce a path attention mechanism for ego-BEV interaction and add two targeted auxiliary tasks: BEV unidirectional distillation and autoregressive online mapping. Extensive evaluations on the nuScenes dataset demonstrate that AdaptiveAD achieves state-of-the-art open-loop planning performance. Crucially, it significantly mitigates the over-reliance on ego status and exhibits impressive generalization capabilities across diverse scenarios.

Decoupling Scene Perception and Ego Status: A Multi-Context Fusion Approach for Enhanced Generalization in End-to-End Autonomous Driving

TL;DR

AdaptiveAD tackles the problem of ego-status over-reliance and causal confusion in end-to-end autonomous driving by architecturally decoupling scene-driven perception from ego-driven planning through a dual-branch design, and then adaptively fusing their outputs with a scene-aware fusion module. It introduces a path-attention mechanism to improve ego-BEV interaction and two auxiliary tasks—BEV unidirectional distillation and autoregressive online mapping—to preserve multi-task learning and consistency. Empirically, AdaptiveAD achieves state-of-the-art open-loop planning performance on nuScenes, while significantly reducing reliance on ego priors and demonstrating strong scene-generalization and robustness to perturbations across additional benchmarks. This architectural paradigm—explicit decoupling of distinct reasoning contexts followed by adaptive fusion—offers a principled route toward safer, more generalizable autonomous driving systems and motivates integration with learned world models for enhanced causal reasoning.

Abstract

Modular design of planning-oriented autonomous driving has markedly advanced end-to-end systems. However, existing architectures remain constrained by an over-reliance on ego status, hindering generalization and robust scene understanding. We identify the root cause as an inherent design within these architectures that allows ego status to be easily leveraged as a shortcut. Specifically, the premature fusion of ego status in the upstream BEV encoder allows an information flow from this strong prior to dominate the downstream planning module. To address this challenge, we propose AdaptiveAD, an architectural-level solution based on a multi-context fusion strategy. Its core is a dual-branch structure that explicitly decouples scene perception and ego status. One branch performs scene-driven reasoning based on multi-task learning, but with ego status deliberately omitted from the BEV encoder, while the other conducts ego-driven reasoning based solely on the planning task. A scene-aware fusion module then adaptively integrates the complementary decisions from the two branches to form the final planning trajectory. To ensure this decoupling does not compromise multi-task learning, we introduce a path attention mechanism for ego-BEV interaction and add two targeted auxiliary tasks: BEV unidirectional distillation and autoregressive online mapping. Extensive evaluations on the nuScenes dataset demonstrate that AdaptiveAD achieves state-of-the-art open-loop planning performance. Crucially, it significantly mitigates the over-reliance on ego status and exhibits impressive generalization capabilities across diverse scenarios.

Paper Structure

This paper contains 28 sections, 8 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Ego-status shortcut and our proposed architectural solution. (a) In conventional architectures, ego status is coupled with scene context, creating a shortcut that allows the planning module to rely on kinematic state. (b) Our AdaptiveAD framework uses a dual-branch design to explicitly decouple scene-driven reasoning from ego-status influence. A scene-aware fusion module then adaptively integrates these complementary decision contexts to generate the final trajectory.
  • Figure 2: An overview of AdaptiveAD framework. Given a sequence of multi-view images, AdaptiveAD first extracts features using a shared backbone. The core of our framework is a dual-branch architecture that explicitly decouples information flow: one branch generates a scene-driven decision without ego-status influence, while a complementary branch produces an ego-driven decision. These distinct decision contexts are then adaptively integrated by a multi-context decision fusion module, which uses dense scene features as priors to generate the final trajectory. The integrity of this process is supported by two auxiliary tasks designed to enhance perceptual quality and enforce causal consistency.
  • Figure 3: Diagram of path attention.
  • Figure 4: Diagram of autoregressive online mapping.
  • Figure 5: Qualitative comparison of scene generalization ability. In this challenging scenario, our AdaptiveAD demonstrates significantly superior perception capabilities compared to VAD, providing more reliable obstacle-avoidance paths.
  • ...and 2 more figures