Table of Contents
Fetching ...

Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)

Zhenjie Yang, Xiaosong Jia, Qifeng Li, Xue Yang, Maoqing Yao, Junchi Yan

TL;DR

Raw2Drive presents the first end-to-end model-based RL framework for autonomous driving by coupling a privileged world-model stream with a raw-sensor stream through a Guidance Mechanism that enforces rollout consistency and transfers supervision. The two-stage training—privileged stream learning followed by guided raw-sensor learning—enables effective RL with raw imagery on CARLA v2, achieving state-of-the-art results on CARLA Leaderboard v2 and Bench2Drive. The approach demonstrates significant efficiency gains and highlights the practical viability of RL for end-to-end driving, while acknowledging limitations related to privileged-input reliance and real-world deployment considerations. Overall, Raw2Drive provides a robust blueprint for integrating privileged information to bootstrap end-to-end model-based driving from raw sensor data.

Abstract

Reinforcement Learning (RL) can mitigate the causal confusion and distribution shift inherent to imitation learning (IL). However, applying RL to end-to-end autonomous driving (E2E-AD) remains an open problem for its training difficulty, and IL is still the mainstream paradigm in both academia and industry. Recently Model-based Reinforcement Learning (MBRL) have demonstrated promising results in neural planning; however, these methods typically require privileged information as input rather than raw sensor data. We fill this gap by designing Raw2Drive, a dual-stream MBRL approach. Initially, we efficiently train an auxiliary privileged world model paired with a neural planner that uses privileged information as input. Subsequently, we introduce a raw sensor world model trained via our proposed Guidance Mechanism, which ensures consistency between the raw sensor world model and the privileged world model during rollouts. Finally, the raw sensor world model combines the prior knowledge embedded in the heads of the privileged world model to effectively guide the training of the raw sensor policy. Raw2Drive is so far the only RL based end-to-end method on CARLA Leaderboard 2.0, and Bench2Drive and it achieves state-of-the-art performance.

Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)

TL;DR

Raw2Drive presents the first end-to-end model-based RL framework for autonomous driving by coupling a privileged world-model stream with a raw-sensor stream through a Guidance Mechanism that enforces rollout consistency and transfers supervision. The two-stage training—privileged stream learning followed by guided raw-sensor learning—enables effective RL with raw imagery on CARLA v2, achieving state-of-the-art results on CARLA Leaderboard v2 and Bench2Drive. The approach demonstrates significant efficiency gains and highlights the practical viability of RL for end-to-end driving, while acknowledging limitations related to privileged-input reliance and real-world deployment considerations. Overall, Raw2Drive provides a robust blueprint for integrating privileged information to bootstrap end-to-end model-based driving from raw sensor data.

Abstract

Reinforcement Learning (RL) can mitigate the causal confusion and distribution shift inherent to imitation learning (IL). However, applying RL to end-to-end autonomous driving (E2E-AD) remains an open problem for its training difficulty, and IL is still the mainstream paradigm in both academia and industry. Recently Model-based Reinforcement Learning (MBRL) have demonstrated promising results in neural planning; however, these methods typically require privileged information as input rather than raw sensor data. We fill this gap by designing Raw2Drive, a dual-stream MBRL approach. Initially, we efficiently train an auxiliary privileged world model paired with a neural planner that uses privileged information as input. Subsequently, we introduce a raw sensor world model trained via our proposed Guidance Mechanism, which ensures consistency between the raw sensor world model and the privileged world model during rollouts. Finally, the raw sensor world model combines the prior knowledge embedded in the heads of the privileged world model to effectively guide the training of the raw sensor policy. Raw2Drive is so far the only RL based end-to-end method on CARLA Leaderboard 2.0, and Bench2Drive and it achieves state-of-the-art performance.

Paper Structure

This paper contains 29 sections, 4 equations, 9 figures, 13 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison of different training paradigms in end-to-end autonomous driving.(a) Imitation Learning suffers from causal confusion d2014confusion and distribution shift wen2020fighting. Model-free Reinforcement Learningtoromanoff2020end faces efficiency problem and fails to converge. (b) Model-based Reinforcement Learning: There are no reported such works for raw sensor input E2E-AD as the raw data can be noisy and redundant, and Think2Drive li2024think assumes the privileged ground truth data is given, which cannot be directly applied in real-world AD. (c) In Raw2Drive, we propose the first feasible model-based reinforcement learning paradigm for end-to-end autonomous driving. By leveraging low-dimensional, structured privileged input, our approach guides the learning of a world model from raw sensor data, effectively addressing the issues outlined in (a) and (b).
  • Figure 2: The Overall Pipeline of Raw2Drive. (a) During training, we use privileged input to train the privileged world model and paired policy. Then, the privileged world model is used to guide the training of the raw sensor stream. (b) During inference, only raw sensor inputs are available, which aligns with real-world autonomous driving. (c) The guidance mechanism consists of two parts: (I) Rollout Guidance to ensure future modeling consistency; (II) Head Guidance to ensure the supervision for raw sensor policy is accurate and stable.
  • Figure 3: Training of Privileged World Model and Policy. The privileged world model $\text{WM}$ is trained with time-sequenced BEV semantic masks as inputs. The privileged policy $\pi$ is trained through rollouts in the privileged world model. The reward $r_t$ and continuous flag $c_t$ are generated by the heads of the privileged world model with the privileged input $o_t$.
  • Figure 4: Training of Raw Sensor World Model. During training, the spatial-temporal feature of the privileged world model serves as supervision instead of reconstructing multi-view video so that the learning could focus on decision related information. RSSM parameters are initialized from the privileged world model.
  • Figure 5: Training of Raw Sensor Policy. The raw sensor policy is trained through RL within the dual-stream world model. The raw model operates under strict deduction, while the reward $r_t$ and continuation flag $c_t$ are derived from the privileged model via pseudo-deduction.
  • ...and 4 more figures