Table of Contents
Fetching ...

Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving

Zhexi Lian, Haoran Wang, Xuerun Yan, Weimeng Lin, Xianhong Zhang, Yongyu Chen, Jia Hu

Abstract

End-to-end autonomous driving is typically built upon imitation learning (IL), yet its performance is constrained by the quality of human demonstrations. To overcome this limitation, recent methods incorporate reinforcement learning (RL) through sequential fine-tuning. However, such a paradigm remains suboptimal: sequential RL fine-tuning can introduce policy drift and often leads to a performance ceiling due to its dependence on the pretrained IL policy. To address these issues, we propose PaIR-Drive, a general Parallel framework for collaborative Imitation and Reinforcement learning in end-to-end autonomous driving. During training, PaIR-Drive separates IL and RL into two parallel branches with conflict-free training objectives, enabling fully collaborative optimization. This design eliminates the need to retrain RL when applying a new IL policy. During inference, RL leverages the IL policy to further optimize the final plan, allowing performance beyond prior knowledge of IL. Furthermore, we introduce a tree-structured trajectory neural sampler to group relative policy optimization (GRPO) in the RL branch, which enhances exploration capability. Extensive analysis on NAVSIMv1 and v2 benchmark demonstrates that PaIR-Drive achieves Competitive performance of 91.2 PDMS and 87.9 EPDMS, building upon Transfuser and DiffusionDrive IL baselines. PaIR-Drive consistently outperforms existing RL fine-tuning methods, and could even correct human experts' suboptimal behaviors. Qualitative results further confirm that PaIR-Drive can effectively explore and generate high-quality trajectories.

Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving

Abstract

End-to-end autonomous driving is typically built upon imitation learning (IL), yet its performance is constrained by the quality of human demonstrations. To overcome this limitation, recent methods incorporate reinforcement learning (RL) through sequential fine-tuning. However, such a paradigm remains suboptimal: sequential RL fine-tuning can introduce policy drift and often leads to a performance ceiling due to its dependence on the pretrained IL policy. To address these issues, we propose PaIR-Drive, a general Parallel framework for collaborative Imitation and Reinforcement learning in end-to-end autonomous driving. During training, PaIR-Drive separates IL and RL into two parallel branches with conflict-free training objectives, enabling fully collaborative optimization. This design eliminates the need to retrain RL when applying a new IL policy. During inference, RL leverages the IL policy to further optimize the final plan, allowing performance beyond prior knowledge of IL. Furthermore, we introduce a tree-structured trajectory neural sampler to group relative policy optimization (GRPO) in the RL branch, which enhances exploration capability. Extensive analysis on NAVSIMv1 and v2 benchmark demonstrates that PaIR-Drive achieves Competitive performance of 91.2 PDMS and 87.9 EPDMS, building upon Transfuser and DiffusionDrive IL baselines. PaIR-Drive consistently outperforms existing RL fine-tuning methods, and could even correct human experts' suboptimal behaviors. Qualitative results further confirm that PaIR-Drive can effectively explore and generate high-quality trajectories.
Paper Structure (13 sections, 7 equations, 7 figures, 6 tables)

This paper contains 13 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Examples of human's bad behaviors in the real-world dataset NAVSIM. (a) Singapore: The human drives in the wrong direction on the opposite lane; (b) Las Vegas: The human violates traffic light and turns left.
  • Figure 2: Comparisons of existing training schemes and ours for end-to-end autonomous driving. (a) One-shot IL$\to$RL: IL-based training with subsequent RL fine-tuning; (b) Iterative IL$\leftrightarrow$RL: alternately conducting IL training and RL fine-tuning; (c) Ours parallel framework for collaborative IL and RL.
  • Figure 3: Training process illustration of the parallel scheme of PaIR-Drive. IL branch follows a typical end-to-end planning fashion and is supervised by the human trajectory. Simultaneously, the RL branch builds upon human trajectories and aims to further explore better trajectories. In the RL branch, a tree-structured trajectory neural sampler is designed to recurrently predict the trajectory point offsets of driving intentions unseen in human demonstrations. Finally, we use trajectories and their simulated rewards for GRPO to update the policy.
  • Figure 4: Inferring process illustration of the parallel scheme of PaIR-Drive. Compared with \ref{['fig:overall_framework']}, we replace the human trajectory in the RL branch with the trajectory generated by the IL branch, while employing an additional trained reward world model to evaluate and select the final plan.
  • Figure 5: Illustration of the tree-structured trajectory neural sampler with the capability of generating trajectories under different driving intentions unseen in human demonstrations.
  • ...and 2 more figures