Table of Contents
Fetching ...

$AutoDrive\text{-}P^3$: Unified Chain of Perception-Prediction-Planning Thought via Reinforcement Fine-Tuning

Yuqi Ye, Zijian Zhang, Junhong Lin, Shangkun Sun, Changhao Peng, Wei Gao

Abstract

Vision-language models (VLMs) are increasingly being adopted for end-to-end autonomous driving systems due to their exceptional performance in handling long-tail scenarios. However, current VLM-based approaches suffer from two major limitations: 1) Some VLMs directly output planning results without chain-of-thought (CoT) reasoning, bypassing crucial perception and prediction stages which creates a significant domain gap and compromises decision-making capability; 2) Other VLMs can generate outputs for perception, prediction, and planning tasks but employ a fragmented decision-making approach where these modules operate separately, leading to a significant lack of synergy that undermines true planning performance. To address these limitations, we propose ${AutoDrive\text{-}P^3}$, a novel framework that seamlessly integrates $\textbf{P}$erception, $\textbf{P}$rediction, and $\textbf{P}$lanning through structured reasoning. We introduce the ${P^3\text{-}CoT}$ dataset to facilitate coherent reasoning and propose ${P^3\text{-}GRPO}$, a hierarchical reinforcement learning algorithm that provides progressive supervision across all three tasks. Specifically, ${AutoDrive\text{-}P^3}$ progressively generates CoT reasoning and answers for perception, prediction, and planning, where perception provides essential information for subsequent prediction and planning, while both perception and prediction collectively contribute to the final planning decisions, enabling safer and more interpretable autonomous driving. Additionally, to balance inference efficiency with performance, we introduce dual thinking modes: detailed thinking and fast thinking. Extensive experiments on both open-loop (nuScenes) and closed-loop (NAVSIMv1/v2) benchmarks demonstrate that our approach achieves state-of-the-art performance in planning tasks. Code is available at https://github.com/haha-yuki-haha/AutoDrive-P3.

$AutoDrive\text{-}P^3$: Unified Chain of Perception-Prediction-Planning Thought via Reinforcement Fine-Tuning

Abstract

Vision-language models (VLMs) are increasingly being adopted for end-to-end autonomous driving systems due to their exceptional performance in handling long-tail scenarios. However, current VLM-based approaches suffer from two major limitations: 1) Some VLMs directly output planning results without chain-of-thought (CoT) reasoning, bypassing crucial perception and prediction stages which creates a significant domain gap and compromises decision-making capability; 2) Other VLMs can generate outputs for perception, prediction, and planning tasks but employ a fragmented decision-making approach where these modules operate separately, leading to a significant lack of synergy that undermines true planning performance. To address these limitations, we propose , a novel framework that seamlessly integrates erception, rediction, and lanning through structured reasoning. We introduce the dataset to facilitate coherent reasoning and propose , a hierarchical reinforcement learning algorithm that provides progressive supervision across all three tasks. Specifically, progressively generates CoT reasoning and answers for perception, prediction, and planning, where perception provides essential information for subsequent prediction and planning, while both perception and prediction collectively contribute to the final planning decisions, enabling safer and more interpretable autonomous driving. Additionally, to balance inference efficiency with performance, we introduce dual thinking modes: detailed thinking and fast thinking. Extensive experiments on both open-loop (nuScenes) and closed-loop (NAVSIMv1/v2) benchmarks demonstrate that our approach achieves state-of-the-art performance in planning tasks. Code is available at https://github.com/haha-yuki-haha/AutoDrive-P3.

Paper Structure

This paper contains 31 sections, 14 equations, 17 figures, 11 tables, 1 algorithm.

Figures (17)

  • Figure 1: The difference between ${AutoDrive\text{-}P^3}$ and other paradigms. Our method combines an end-to-end training framework with a three-stage collaborative supervision form with VLM.
  • Figure 2: Overview of ${AutoDrive\text{-}P^3}$. It processes video and ego vehicle data through structured Perception-Prediction-Planning Chain-of-Thought ($P^3\text{-}CoT$) reasoning, generating interpretable step-by-step rationale and structured outputs for perception, prediction, and planning.
  • Figure 3: The pipeline for constructing ${P^3\text{-}CoT}$ dataset. We first sample data and annotations from existing datasets, then construct the labels of samples, focusing on key objects and using rule-based and manual filtering. Finally, with the help of advanced VLM, we construct the CoT, focusing on the connection among perception, prediction and planning three stages.
  • Figure 4: The pipeline of ${P^3\text{-}GRPO}$. We first cold start the base model using $P^3\text{-}CoT$ to make up for the gap between VLM and autonomous driving and learn the CoT answer format. Next we use GRPO to find the best optimization path and update our model.
  • Figure 5: Dual thinking modes and running time on nuScenes Benchmark.
  • ...and 12 more figures