Table of Contents
Fetching ...

PerlAD: Towards Enhanced Closed-loop End-to-end Autonomous Driving with Pseudo-simulation-based Reinforcement Learning

Yinfeng Gao, Qichao Zhang, Deqing Liu, Zhongpu Xia, Guang Li, Kun Ma, Guang Chen, Hangjun Ye, Long Chen, Da-Wei Ding, Dongbin Zhao

Abstract

End-to-end autonomous driving policies based on Imitation Learning (IL) often struggle in closed-loop execution due to the misalignment between inadequate open-loop training objectives and real driving requirements. While Reinforcement Learning (RL) offers a solution by directly optimizing driving goals via reward signals, the rendering-based training environments introduce the rendering gap and are inefficient due to high computational costs. To overcome these challenges, we present a novel Pseudo-simulation-based RL method for closed-loop end-to-end autonomous driving, PerlAD. Based on offline datasets, PerlAD constructs a pseudo-simulation that operates in vector space, enabling efficient, rendering-free trial-and-error training. To bridge the gap between static datasets and dynamic closed-loop environments, PerlAD introduces a prediction world model that generates reactive agent trajectories conditioned on the ego vehicle's plan. Furthermore, to facilitate efficient planning, PerlAD utilizes a hierarchical decoupled planner that combines IL for lateral path generation and RL for longitudinal speed optimization. Comprehensive experimental results demonstrate that PerlAD achieves state-of-the-art performance on the Bench2Drive benchmark, surpassing the previous E2E RL method by 10.29% in Driving Score without requiring expensive online interactions. Additional evaluations on the DOS benchmark further confirm its reliability in handling safety-critical occlusion scenarios.

PerlAD: Towards Enhanced Closed-loop End-to-end Autonomous Driving with Pseudo-simulation-based Reinforcement Learning

Abstract

End-to-end autonomous driving policies based on Imitation Learning (IL) often struggle in closed-loop execution due to the misalignment between inadequate open-loop training objectives and real driving requirements. While Reinforcement Learning (RL) offers a solution by directly optimizing driving goals via reward signals, the rendering-based training environments introduce the rendering gap and are inefficient due to high computational costs. To overcome these challenges, we present a novel Pseudo-simulation-based RL method for closed-loop end-to-end autonomous driving, PerlAD. Based on offline datasets, PerlAD constructs a pseudo-simulation that operates in vector space, enabling efficient, rendering-free trial-and-error training. To bridge the gap between static datasets and dynamic closed-loop environments, PerlAD introduces a prediction world model that generates reactive agent trajectories conditioned on the ego vehicle's plan. Furthermore, to facilitate efficient planning, PerlAD utilizes a hierarchical decoupled planner that combines IL for lateral path generation and RL for longitudinal speed optimization. Comprehensive experimental results demonstrate that PerlAD achieves state-of-the-art performance on the Bench2Drive benchmark, surpassing the previous E2E RL method by 10.29% in Driving Score without requiring expensive online interactions. Additional evaluations on the DOS benchmark further confirm its reliability in handling safety-critical occlusion scenarios.
Paper Structure (30 sections, 10 equations, 5 figures, 6 tables)

This paper contains 30 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Different training paradigms of E2E autonomous driving.
  • Figure 2: The framework of PerlAD. (a) The RL training loop. An offline dataset initializes motion states for the pseudo-simulation and provides sensor observations for the E2E model. The E2E model generates reactive agent predictions and planning actions, which are then provided to the simulation to compute rewards. (b) The pseudo-simulation environment. It is responsible for simulating future scenarios and calculating rewards. (c) The structure of the E2E model. The sparse perception extracts structured representations, which are then processed by unified transformer blocks for feature interaction. This is followed by a decoupled planner that outputs lateral and longitudinal actions, and a prediction world model that generates reactive agent trajectories.
  • Figure 3: PerlAD adopts a hierarchical decoupled planning scheme: the lateral planner outputs multi-modal future paths, while the longitudinal planner generates target speeds conditioned on the selected path.
  • Figure 4: The Prediction World Model autoregressively generates multi-modal trajectory predictions for surrounding agents, explicitly conditioned on the ego's future trajectory. The highest-probability modality is selected for subsequent simulation and reward computation.
  • Figure 5: Ablation on the influence of reactive scenarios portions and cumulative reactive scenarios number. Driving Score on Dev10 is reported.