Table of Contents
Fetching ...

CarPlanner: Consistent Auto-regressive Trajectory Planning for Large-scale Reinforcement Learning in Autonomous Driving

Dongkun Zhang, Jiaming Liang, Ke Guo, Sha Lu, Qi Wang, Rong Xiong, Zhenwei Miao, Yue Wang

TL;DR

CarPlanner tackles the challenge of RL-based trajectory planning for autonomous driving at scale by introducing a consistent auto-regressive framework that generates multi-modal ego trajectories conditioned on stable mode information. The architecture follows a generation-select paradigm with a non-reactive transition model, a mode selector, a trajectory generator, and a rule-augmented selector, enhanced by an invariant-view module and an expert-guided universal reward. Empirical results on nuPlan show state-of-the-art closed-loop performance, with RL-based CarPlanner outperforming RL-, IL-, and rule-based SOTAs in non-reactive settings and matching performance in reactive scenarios under certain conditions. The work provides practical guidance on model-based RL design for planning, clarifying the trade-offs between RL and IL and highlighting the benefits of consistency and mode-conditioned generation for large-scale driving tasks.

Abstract

Trajectory planning is vital for autonomous driving, ensuring safe and efficient navigation in complex environments. While recent learning-based methods, particularly reinforcement learning (RL), have shown promise in specific scenarios, RL planners struggle with training inefficiencies and managing large-scale, real-world driving scenarios. In this paper, we introduce \textbf{CarPlanner}, a \textbf{C}onsistent \textbf{a}uto-\textbf{r}egressive \textbf{Planner} that uses RL to generate multi-modal trajectories. The auto-regressive structure enables efficient large-scale RL training, while the incorporation of consistency ensures stable policy learning by maintaining coherent temporal consistency across time steps. Moreover, CarPlanner employs a generation-selection framework with an expert-guided reward function and an invariant-view module, simplifying RL training and enhancing policy performance. Extensive analysis demonstrates that our proposed RL framework effectively addresses the challenges of training efficiency and performance enhancement, positioning CarPlanner as a promising solution for trajectory planning in autonomous driving. To the best of our knowledge, we are the first to demonstrate that the RL-based planner can surpass both IL- and rule-based state-of-the-arts (SOTAs) on the challenging large-scale real-world dataset nuPlan. Our proposed CarPlanner surpasses RL-, IL-, and rule-based SOTA approaches within this demanding dataset.

CarPlanner: Consistent Auto-regressive Trajectory Planning for Large-scale Reinforcement Learning in Autonomous Driving

TL;DR

CarPlanner tackles the challenge of RL-based trajectory planning for autonomous driving at scale by introducing a consistent auto-regressive framework that generates multi-modal ego trajectories conditioned on stable mode information. The architecture follows a generation-select paradigm with a non-reactive transition model, a mode selector, a trajectory generator, and a rule-augmented selector, enhanced by an invariant-view module and an expert-guided universal reward. Empirical results on nuPlan show state-of-the-art closed-loop performance, with RL-based CarPlanner outperforming RL-, IL-, and rule-based SOTAs in non-reactive settings and matching performance in reactive scenarios under certain conditions. The work provides practical guidance on model-based RL design for planning, clarifying the trade-offs between RL and IL and highlighting the benefits of consistency and mode-conditioned generation for large-scale driving tasks.

Abstract

Trajectory planning is vital for autonomous driving, ensuring safe and efficient navigation in complex environments. While recent learning-based methods, particularly reinforcement learning (RL), have shown promise in specific scenarios, RL planners struggle with training inefficiencies and managing large-scale, real-world driving scenarios. In this paper, we introduce \textbf{CarPlanner}, a \textbf{C}onsistent \textbf{a}uto-\textbf{r}egressive \textbf{Planner} that uses RL to generate multi-modal trajectories. The auto-regressive structure enables efficient large-scale RL training, while the incorporation of consistency ensures stable policy learning by maintaining coherent temporal consistency across time steps. Moreover, CarPlanner employs a generation-selection framework with an expert-guided reward function and an invariant-view module, simplifying RL training and enhancing policy performance. Extensive analysis demonstrates that our proposed RL framework effectively addresses the challenges of training efficiency and performance enhancement, positioning CarPlanner as a promising solution for trajectory planning in autonomous driving. To the best of our knowledge, we are the first to demonstrate that the RL-based planner can surpass both IL- and rule-based state-of-the-arts (SOTAs) on the challenging large-scale real-world dataset nuPlan. Our proposed CarPlanner surpasses RL-, IL-, and rule-based SOTA approaches within this demanding dataset.

Paper Structure

This paper contains 26 sections, 11 equations, 6 figures, 11 tables, 1 algorithm.

Figures (6)

  • Figure 1: Frameworks for multi-step trajectory generation. (a) Initialization-refinement that generates an initial trajectory and refines it iteratively. (b) Vanilla auto-regressive models that decode subsequent poses sequentially. (c) Our consistent auto-regressive model that integrates time-consistent mode information.
  • Figure 2: CarPlanner contains four parts. (1) The non-reactive transition model takes initial state $\boldsymbol{s}_0$ as input and predicts the future trajectories of traffic agents. (2) The mode selector outputs scores based on the initial state and the modes $\boldsymbol{c}$. (3) The trajectory generator obeys an auto-regressive structure condition on the consistent mode and produces mode-aligned multi-modal trajectories. (4) The rule-augmented selector compensates the mode scores by safety, comfort, and progress metrics.
  • Figure 3: Qualitative comparison of PDM-Closed and our method in non-reactive environments. The scenario is annotated as waiting_for_pedestrian_to_cross. In each frame shot, ego vehicle is marked as green. Traffic agents are marked as sky blue. Lineplot with blue is the ego planned trajectory.
  • Figure 4: Performance of different training time horizons under different testing time horizons. The value in each cell is the CLS-NR metric on the Test14-Random non-reactive benchmark.
  • Figure 5: Qualitative comparison of using reactive and non-reactive transition model in non-reactive environments. The scenario is annotated as waiting_for_pedestrian_to_cross. In each frame shot, ego vehicle is marked as green. Traffic agents are marked as sky blue. Lineplot with blue is the ego planned trajectory.
  • ...and 1 more figures