Table of Contents
Fetching ...

Offline Reinforcement Learning for End-to-End Autonomous Driving

Chihiro Noguchi, Takaki Yamamoto

TL;DR

<3-5 sentence high-level summary> End-to-end autonomous driving models trained purely from camera input face imitation-learning shortcomings, including covariate shift and causal confusion. The authors propose a camera-only offline RL framework with a discrete-action actor-critic model, augmented by pseudo-expert regularization derived from interpolated expert trajectories and trained entirely on fixed simulator data evaluated in a neural-rendering environment built on nuScenes. They demonstrate substantial safety improvements (lower collision rates) and route completion gains over IL baselines, with ablations highlighting the importance of pseudo-expert regularization, reward shaping, and careful behavior-policy composition. The work offers a data-efficient path for improving E2E driving in safety-critical contexts and underscores the impact of dataset design and regularization in offline RL for autonomous systems.

Abstract

End-to-end (E2E) autonomous driving models that take only camera images as input and directly predict a future trajectory are appealing for their computational efficiency and potential for improved generalization via unified optimization; however, persistent failure modes remain due to reliance on imitation learning (IL). While online reinforcement learning (RL) could mitigate IL-induced issues, the computational burden of neural rendering-based simulation and large E2E networks renders iterative reward and hyperparameter tuning costly. We introduce a camera-only E2E offline RL framework that performs no additional exploration and trains solely on a fixed simulator dataset. Offline RL offers strong data efficiency and rapid experimental iteration, yet is susceptible to instability from overestimation on out-of-distribution (OOD) actions. To address this, we construct pseudo ground-truth trajectories from expert driving logs and use them as a behavior regularization signal, suppressing imitation of unsafe or suboptimal behavior while stabilizing value learning. Training and closed-loop evaluation are conducted in a neural rendering environment learned from the public nuScenes dataset. Empirically, the proposed method achieves substantial improvements in collision rate and route completion compared with IL baselines. Our code will be available at [URL].

Offline Reinforcement Learning for End-to-End Autonomous Driving

TL;DR

<3-5 sentence high-level summary> End-to-end autonomous driving models trained purely from camera input face imitation-learning shortcomings, including covariate shift and causal confusion. The authors propose a camera-only offline RL framework with a discrete-action actor-critic model, augmented by pseudo-expert regularization derived from interpolated expert trajectories and trained entirely on fixed simulator data evaluated in a neural-rendering environment built on nuScenes. They demonstrate substantial safety improvements (lower collision rates) and route completion gains over IL baselines, with ablations highlighting the importance of pseudo-expert regularization, reward shaping, and careful behavior-policy composition. The work offers a data-efficient path for improving E2E driving in safety-critical contexts and underscores the impact of dataset design and regularization in offline RL for autonomous systems.

Abstract

End-to-end (E2E) autonomous driving models that take only camera images as input and directly predict a future trajectory are appealing for their computational efficiency and potential for improved generalization via unified optimization; however, persistent failure modes remain due to reliance on imitation learning (IL). While online reinforcement learning (RL) could mitigate IL-induced issues, the computational burden of neural rendering-based simulation and large E2E networks renders iterative reward and hyperparameter tuning costly. We introduce a camera-only E2E offline RL framework that performs no additional exploration and trains solely on a fixed simulator dataset. Offline RL offers strong data efficiency and rapid experimental iteration, yet is susceptible to instability from overestimation on out-of-distribution (OOD) actions. To address this, we construct pseudo ground-truth trajectories from expert driving logs and use them as a behavior regularization signal, suppressing imitation of unsafe or suboptimal behavior while stabilizing value learning. Training and closed-loop evaluation are conducted in a neural rendering environment learned from the public nuScenes dataset. Empirically, the proposed method achieves substantial improvements in collision rate and route completion compared with IL baselines. Our code will be available at [URL].

Paper Structure

This paper contains 22 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Driving policy learning paradigms. (a) Imitation Learning (IL): Supervised learning on a fixed expert dataset. (b) Online Reinforcement Learning (RL): Policy learns via continuous live interaction with a simulator. (c) Offline Reinforcement Learning (Our Approach): Policy learns from a fixed, pre-collected dataset without new simulator interaction.
  • Figure 2: An overview of the proposed camera-only offline RL framework. (a) Data Collection: An offline dataset is generated by executing various behavior policies in a neural rendering simulator to collect rollout data. (b) Policy Network: A discrete action-space Actor-Critic network , built upon an Encoder and BEV Decoder, is trained using the fixed offline dataset. (c) Pseudo-expert Action: A pseudo-expert action is generated from the offline dataset by interpolating expert ground-truth trajectories. (d) Closed-loop Evaluation: The trained policy is evaluated in the simulator on two distinct suites: General Driving Scenarios and Safety-critical Scenarios.
  • Figure 3: Influence of behavior policy on learned strategy. Model IDs (e.g., M4) are consistent across both subfigures. (a) Radar chart comparing policy 'personalities' across metrics. All values are normalized with min and max values among all models. (b) Scatter plot showing the trade-off between general driving efficiency ($\text{RC}_\text{Gen}$) and safety-critical non-collision rate ($1-\text{CR}_\text{Safe}$).
  • Figure 4: Qualitative results in two safety-critical NeuroNCAP scenarios. We compare the trajectories of our offline RL model (VADv2*, trained on VAD($\sigma=0.2$) + VAD($\sigma=0.4$) dataset) against a baseline IL model and an RL ablation (VADv2†, trained on VAD($\sigma=0.2$) + Random dataset). The numbers assigned for each waypoint indicate the frame index. (a) A scooter cuts into the ego-vehicle's path. (b) A stationary vehicle obstructs the lane.
  • Figure 5: Scatter plot showing the trade-off between route completion in general driving scenarios and non-collision rate in safety-critical scenarios.
  • ...and 3 more figures