From Imitation to Exploration: End-to-end Autonomous Driving based on World Model

Yueyuan Li; Mingyang Jiang; Songan Zhang; Wei Yuan; Chunxiang Wang; Ming Yang

From Imitation to Exploration: End-to-end Autonomous Driving based on World Model

Yueyuan Li, Mingyang Jiang, Songan Zhang, Wei Yuan, Chunxiang Wang, Ming Yang

TL;DR

RAMBLE addresses generalization gaps in end-to-end autonomous driving by fusing imitation learning with a world-model-based reinforcement learning framework. It introduces an asymmetric VAE (V model) for multi-modal perception, a Transformer-based M model for dynamics, and an SAC-based C model for control, all trained via a staged IL-to-RL curriculum and guided by a differentiable action mask. The approach achieves state-of-the-art route completion on CARLA Leaderboard 1.0 and completes 38 interactive scenarios on Leaderboard 2.0, demonstrating robust performance in diverse weather and traffic conditions. The work highlights the value of combining imitation with exploration for efficient and safe driving policy learning, and releases RAMBLE as open-source to accelerate future research.

Abstract

In recent years, end-to-end autonomous driving architectures have gained increasing attention due to their advantage in avoiding error accumulation. Most existing end-to-end autonomous driving methods are based on Imitation Learning (IL), which can quickly derive driving strategies by mimicking expert behaviors. However, IL often struggles to handle scenarios outside the training dataset, especially in high-dynamic and interaction-intensive traffic environments. In contrast, Reinforcement Learning (RL)-based driving models can optimize driving decisions through interaction with the environment, improving adaptability and robustness. To leverage the strengths of both IL and RL, we propose RAMBLE, an end-to-end world model-based RL method for driving decision-making. RAMBLE extracts environmental context information from RGB images and LiDAR data through an asymmetrical variational autoencoder. A transformer-based architecture is then used to capture the dynamic transitions of traffic participants. Next, an actor-critic structure reinforcement learning algorithm is applied to derive driving strategies based on the latent features of the current state and dynamics. To accelerate policy convergence and ensure stable training, we introduce a training scheme that initializes the policy network using IL, and employs KL loss and soft update mechanisms to smoothly transition the model from IL to RL. RAMBLE achieves state-of-the-art performance in route completion rate on the CARLA Leaderboard 1.0 and completes all 38 scenarios on the CARLA Leaderboard 2.0, demonstrating its effectiveness in handling complex and dynamic traffic scenarios. The model will be open-sourced upon paper acceptance at https://github.com/SCP-CN-001/ramble to support further research and development in autonomous driving.

From Imitation to Exploration: End-to-end Autonomous Driving based on World Model

TL;DR

Abstract

Paper Structure (32 sections, 20 equations, 3 figures, 3 tables)

This paper contains 32 sections, 20 equations, 3 figures, 3 tables.

Introduction
Related Works
End-to-end Driving with Deep Learning
World Model in Autonomous Driving
Method
Overview
V Model
M Model
C Model
RL Agent
Step Reward
Reward for speed $r_\textrm{speed}$
Reward for traveled distance $r_\textrm{distance}$
Penalty for route deviation angle $r_\textrm{dev angle}$
Penalty for route distance deviation $r_\textrm{dev distance}$
...and 17 more sections

Figures (3)

Figure 1: The overall structure of RAMBLE. The V model compresses multi-view RGB images, LiDAR point clouds, and route points to the latent feature $z_t$, which describes the current state. The M model takes the latent features $z_{t-n:t}$ and actions $a_{t-n:t}$ to predict the environmental dynamics by the latent feature $h_t$. The C model generates action commands based on $z_t$ and $h_t$.
Figure 2: A visualization of the V model's state latent features and the M model's estimated state latent feature.
Figure 4: The C model's performance when it has been pretrained under the IL paradigm and then transitioned to the RL paradigm. The line plots are time-weighted EMAs of the data points with a window size of 50. The shadowed areas depict the true value of the route completion rate and the step reward.

From Imitation to Exploration: End-to-end Autonomous Driving based on World Model

TL;DR

Abstract

From Imitation to Exploration: End-to-end Autonomous Driving based on World Model

Authors

TL;DR

Abstract

Table of Contents

Figures (3)