Table of Contents
Fetching ...

Model-Based Policy Adaptation for Closed-Loop End-to-End Autonomous Driving

Haohong Lin, Yunzhi Zhang, Wenhao Ding, Jiajun Wu, Ding Zhao

TL;DR

End-to-end driving models excel in open-loop settings but falter in closed-loop deployment due to observation and objective mismatches. The authors propose Model-Based Policy Adaptation (MPA), which uses geometry-consistent 3DGS-based counterfactual data, a diffusion-based policy adapter to refine base policies, and a multi-step Q-value model to guide inference-time action selection. Inference-time sampling leverages multiple trajectory proposals, selecting the one with the highest estimated long-horizon utility. Experiments on nuScenes with a photorealistic simulator demonstrate robust gains across in-domain, out-of-domain, and safety-critical scenarios, with ablations confirming the importance of the adapter, the Q-guided guidance, and the counterfactual data scale for performance.

Abstract

End-to-end (E2E) autonomous driving models have demonstrated strong performance in open-loop evaluations but often suffer from cascading errors and poor generalization in closed-loop settings. To address this gap, we propose Model-based Policy Adaptation (MPA), a general framework that enhances the robustness and safety of pretrained E2E driving agents during deployment. MPA first generates diverse counterfactual trajectories using a geometry-consistent simulation engine, exposing the agent to scenarios beyond the original dataset. Based on this generated data, MPA trains a diffusion-based policy adapter to refine the base policy's predictions and a multi-step Q value model to evaluate long-term outcomes. At inference time, the adapter proposes multiple trajectory candidates, and the Q value model selects the one with the highest expected utility. Experiments on the nuScenes benchmark using a photorealistic closed-loop simulator demonstrate that MPA significantly improves performance across in-domain, out-of-domain, and safety-critical scenarios. We further investigate how the scale of counterfactual data and inference-time guidance strategies affect overall effectiveness.

Model-Based Policy Adaptation for Closed-Loop End-to-End Autonomous Driving

TL;DR

End-to-end driving models excel in open-loop settings but falter in closed-loop deployment due to observation and objective mismatches. The authors propose Model-Based Policy Adaptation (MPA), which uses geometry-consistent 3DGS-based counterfactual data, a diffusion-based policy adapter to refine base policies, and a multi-step Q-value model to guide inference-time action selection. Inference-time sampling leverages multiple trajectory proposals, selecting the one with the highest estimated long-horizon utility. Experiments on nuScenes with a photorealistic simulator demonstrate robust gains across in-domain, out-of-domain, and safety-critical scenarios, with ablations confirming the importance of the adapter, the Q-guided guidance, and the counterfactual data scale for performance.

Abstract

End-to-end (E2E) autonomous driving models have demonstrated strong performance in open-loop evaluations but often suffer from cascading errors and poor generalization in closed-loop settings. To address this gap, we propose Model-based Policy Adaptation (MPA), a general framework that enhances the robustness and safety of pretrained E2E driving agents during deployment. MPA first generates diverse counterfactual trajectories using a geometry-consistent simulation engine, exposing the agent to scenarios beyond the original dataset. Based on this generated data, MPA trains a diffusion-based policy adapter to refine the base policy's predictions and a multi-step Q value model to evaluate long-term outcomes. At inference time, the adapter proposes multiple trajectory candidates, and the Q value model selects the one with the highest expected utility. Experiments on the nuScenes benchmark using a photorealistic closed-loop simulator demonstrate that MPA significantly improves performance across in-domain, out-of-domain, and safety-critical scenarios. We further investigate how the scale of counterfactual data and inference-time guidance strategies affect overall effectiveness.

Paper Structure

This paper contains 48 sections, 12 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: Left: Causes of closed-loop performance degradation in End-to-End driving, including observation and objective mismatches. Right: We propose counterfactual data generation to address the observation mismatch, and a model-based policy adaptation framework tackling the objective mismatch.
  • Figure 2: Comparison of average L2 error in the motion prediction under different prediction horizons.
  • Figure 3: Overview of Model-Based Policy Adaptation (MPA). Left: We propose a counterfactual data generation pipeline, where we first generate initial data of $T$-step trajectories rolled out with pretrained E2E policy and 3DGS-based world model, and then filter the generated data with reward and distance constraints to improve data realism, resulting in counterfactual (action, state, observation, reward) sequences. Right: We utilize the generated data to train two MPA modules: (i) a diffusion policy adapter predicting residual actions on top of a base E2E agent, and (ii) a value model $Q$ estimating multi-step cumulative rewards under different principles, such as collision and speed.
  • Figure 4: Ego's augmented trajectories after rotating, warping, and random noising operations.
  • Figure 5: Qualitative Results in the in-domain and safety-critical scene. The silver car cuts in from the right side, forcing the ego vehicle to yield. Compared to the pretraind VAD, MPA-adapted policy can successfully brake and yield to the cut-in vehicles under the guidance of the Q-value model.
  • ...and 8 more figures