Table of Contents
Fetching ...

AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models

Tianyi Yan, Tao Tang, Xingtai Gui, Yongkang Li, Jiasen Zhesng, Weiyao Huang, Lingdong Kong, Wencheng Han, Xia Zhou, Xueyang Zhang, Yifei Zhan, Kun Zhan, Cheng-zhong Xu, Jianbing Shen

TL;DR

<3-5 sentence high-level summary>AD-R1 tackles the safety gap in RL-based end-to-end autonomous driving by exposing and correcting the optimistic bias in world models. It introduces an Impartial World Model trained with Counterfactual Synthesis to faithfully imagine hazardous outcomes and serves as an internal critic for offline policy refinement via Group Relative Policy Optimization, guided by dense 4D rewards. A new Risk Foreseeing Benchmark quantifies the model's ability to predict failures (G-IoU, f-IoU, DAF), and extensive experiments show substantial safety improvements (e.g., $+1.7\%$ PDMS) without sacrificing performance. The work demonstrates that teaching a model to dream of danger is a practical, scalable path toward truly safe and capable autonomous agents.

Abstract

End-to-end models for autonomous driving hold the promise of learning complex behaviors directly from sensor data, but face critical challenges in safety and handling long-tail events. Reinforcement Learning (RL) offers a promising path to overcome these limitations, yet its success in autonomous driving has been elusive. We identify a fundamental flaw hindering this progress: a deep seated optimistic bias in the world models used for RL. To address this, we introduce a framework for post-training policy refinement built around an Impartial World Model. Our primary contribution is to teach this model to be honest about danger. We achieve this with a novel data synthesis pipeline, Counterfactual Synthesis, which systematically generates a rich curriculum of plausible collisions and off-road events. This transforms the model from a passive scene completer into a veridical forecaster that remains faithful to the causal link between actions and outcomes. We then integrate this Impartial World Model into our closed-loop RL framework, where it serves as an internal critic. During refinement, the agent queries the critic to ``dream" of the outcomes for candidate actions. We demonstrate through extensive experiments, including on a new Risk Foreseeing Benchmark, that our model significantly outperforms baselines in predicting failures. Consequently, when used as a critic, it enables a substantial reduction in safety violations in challenging simulations, proving that teaching a model to dream of danger is a critical step towards building truly safe and intelligent autonomous agents.

AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models

TL;DR

<3-5 sentence high-level summary>AD-R1 tackles the safety gap in RL-based end-to-end autonomous driving by exposing and correcting the optimistic bias in world models. It introduces an Impartial World Model trained with Counterfactual Synthesis to faithfully imagine hazardous outcomes and serves as an internal critic for offline policy refinement via Group Relative Policy Optimization, guided by dense 4D rewards. A new Risk Foreseeing Benchmark quantifies the model's ability to predict failures (G-IoU, f-IoU, DAF), and extensive experiments show substantial safety improvements (e.g., PDMS) without sacrificing performance. The work demonstrates that teaching a model to dream of danger is a practical, scalable path toward truly safe and capable autonomous agents.

Abstract

End-to-end models for autonomous driving hold the promise of learning complex behaviors directly from sensor data, but face critical challenges in safety and handling long-tail events. Reinforcement Learning (RL) offers a promising path to overcome these limitations, yet its success in autonomous driving has been elusive. We identify a fundamental flaw hindering this progress: a deep seated optimistic bias in the world models used for RL. To address this, we introduce a framework for post-training policy refinement built around an Impartial World Model. Our primary contribution is to teach this model to be honest about danger. We achieve this with a novel data synthesis pipeline, Counterfactual Synthesis, which systematically generates a rich curriculum of plausible collisions and off-road events. This transforms the model from a passive scene completer into a veridical forecaster that remains faithful to the causal link between actions and outcomes. We then integrate this Impartial World Model into our closed-loop RL framework, where it serves as an internal critic. During refinement, the agent queries the critic to ``dream" of the outcomes for candidate actions. We demonstrate through extensive experiments, including on a new Risk Foreseeing Benchmark, that our model significantly outperforms baselines in predicting failures. Consequently, when used as a critic, it enables a substantial reduction in safety violations in challenging simulations, proving that teaching a model to dream of danger is a critical step towards building truly safe and intelligent autonomous agents.

Paper Structure

This paper contains 38 sections, 19 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: (a) Conventional RL Post-training relies on external simulators, suffering from a sim-to-real gap and heuristic rewards. (b) Our Paradigm uses a learned Occupancy World Model as an internal, generative simulator, enabling learned interactions and physically-grounded rewards. (c) The "Optimistic Bias": standard world models, however, trained only on safe data, fail to predict danger. They hallucinate safe futures for unsafe plans, providing dangerously high rewards. (d) Our Impartial World Model, trained with Counterfactual Synthesis Data, learns to imagine failure. It faithfully predicts the hazardous outcome, providing a correct, punitive reward, and enabling the agent to learn safely.
  • Figure 2: Overview of AD-R1. Our framework involves two stages: (a) Impartial World Model Training: We first introduce a Counterfactual Synthesis pipeline that decomposes real scenes and programmatically generates unsafe trajectories to create Synthetic Negative Data. Our Impartial World Model is then trained on a mix of this synthetic failure data and real-world safe data. (b) Reinforcement Learning Post-training: The trained model acts as an internal critic. It takes candidate trajectories from a pre-trained agent, "dreams" of the 4D future outcomes, and our 4D Rewarded Modeling module computes a dense reward based on the imagined world. This provides a strong Policy Loss to refine the agent's safety and robustness through imagined failures.
  • Figure 3: Behavior of an agent with and without AD-R1 refinement. Left: The original agent's plan results in a collision or off-road. Right: Our refined agent safely avoids the hazard.
  • Figure 4: Qualitative comparison of world models.Top: The reference synthetic data. Middle: DOMEgu2024dome's optimistic hallucination. Bottom:AD-R1's faithful off-road prediction.
  • Figure 5: Examples of synthetic unsafe trajectories for the Counterfactual Data (Off-road). Red dot indicates the end of the trajectory while the green one indicates the start point.
  • ...and 3 more figures