AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models

Tianyi Yan; Tao Tang; Xingtai Gui; Yongkang Li; Jiasen Zhesng; Weiyao Huang; Lingdong Kong; Wencheng Han; Xia Zhou; Xueyang Zhang; Yifei Zhan; Kun Zhan; Cheng-zhong Xu; Jianbing Shen

AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models

Tianyi Yan, Tao Tang, Xingtai Gui, Yongkang Li, Jiasen Zhesng, Weiyao Huang, Lingdong Kong, Wencheng Han, Xia Zhou, Xueyang Zhang, Yifei Zhan, Kun Zhan, Cheng-zhong Xu, Jianbing Shen

TL;DR

<3-5 sentence high-level summary>AD-R1 tackles the safety gap in RL-based end-to-end autonomous driving by exposing and correcting the optimistic bias in world models. It introduces an Impartial World Model trained with Counterfactual Synthesis to faithfully imagine hazardous outcomes and serves as an internal critic for offline policy refinement via Group Relative Policy Optimization, guided by dense 4D rewards. A new Risk Foreseeing Benchmark quantifies the model's ability to predict failures (G-IoU, f-IoU, DAF), and extensive experiments show substantial safety improvements (e.g., $+1.7\%$ PDMS) without sacrificing performance. The work demonstrates that teaching a model to dream of danger is a practical, scalable path toward truly safe and capable autonomous agents.

Abstract

End-to-end models for autonomous driving hold the promise of learning complex behaviors directly from sensor data, but face critical challenges in safety and handling long-tail events. Reinforcement Learning (RL) offers a promising path to overcome these limitations, yet its success in autonomous driving has been elusive. We identify a fundamental flaw hindering this progress: a deep seated optimistic bias in the world models used for RL. To address this, we introduce a framework for post-training policy refinement built around an Impartial World Model. Our primary contribution is to teach this model to be honest about danger. We achieve this with a novel data synthesis pipeline, Counterfactual Synthesis, which systematically generates a rich curriculum of plausible collisions and off-road events. This transforms the model from a passive scene completer into a veridical forecaster that remains faithful to the causal link between actions and outcomes. We then integrate this Impartial World Model into our closed-loop RL framework, where it serves as an internal critic. During refinement, the agent queries the critic to ``dream" of the outcomes for candidate actions. We demonstrate through extensive experiments, including on a new Risk Foreseeing Benchmark, that our model significantly outperforms baselines in predicting failures. Consequently, when used as a critic, it enables a substantial reduction in safety violations in challenging simulations, proving that teaching a model to dream of danger is a critical step towards building truly safe and intelligent autonomous agents.

AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models

TL;DR

Abstract

AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)