Table of Contents
Fetching ...

AdaThinkDrive: Adaptive Thinking via Reinforcement Learning for Autonomous Driving

Yuechen Luo, Fang Li, Shaoqing Xu, Zhiyi Lai, Lei Yang, Qimao Chen, Ziang Luo, Zixun Xie, Shengyin Jiang, Jiaxin Liu, Long Chen, Bing Wang, Zhi-xin Yang

TL;DR

AdaThinkDrive advances autonomous driving by enabling adaptive thinking, switching between fast direct predictions and slow CoT reasoning based on scene complexity. It combines three-stage training (large-scale driving QA pretraining, dual-mode SFT with Think/Non-Think outputs, and GRPO-based reinforcement learning) with a four-component Adaptive Think Reward to learn when to reason. Empirical results on NAVSIM show state-of-the-art PDMS among vision-only methods, with notable gains from adaptive reasoning and reduced inference time relative to full-think baselines. The work demonstrates that selectively applying CoT can achieve superior planning accuracy and efficiency in diverse driving scenarios, supported by extensive ablations.

Abstract

While reasoning technology like Chain of Thought (CoT) has been widely adopted in Vision Language Action (VLA) models, it demonstrates promising capabilities in end to end autonomous driving. However, recent efforts to integrate CoT reasoning often fall short in simple scenarios, introducing unnecessary computational overhead without improving decision quality. To address this, we propose AdaThinkDrive, a novel VLA framework with a dual mode reasoning mechanism inspired by fast and slow thinking. First, our framework is pretrained on large scale autonomous driving (AD) scenarios using both question answering (QA) and trajectory datasets to acquire world knowledge and driving commonsense. During supervised fine tuning (SFT), we introduce a two mode dataset, fast answering (w/o CoT) and slow thinking (with CoT), enabling the model to distinguish between scenarios that require reasoning. Furthermore, an Adaptive Think Reward strategy is proposed in conjunction with the Group Relative Policy Optimization (GRPO), which rewards the model for selectively applying CoT by comparing trajectory quality across different reasoning modes. Extensive experiments on the Navsim benchmark show that AdaThinkDrive achieves a PDMS of 90.3, surpassing the best vision only baseline by 1.7 points. Moreover, ablations show that AdaThinkDrive surpasses both the never Think and always Think baselines, improving PDMS by 2.0 and 1.4, respectively. It also reduces inference time by 14% compared to the always Think baseline, demonstrating its ability to balance accuracy and efficiency through adaptive reasoning.

AdaThinkDrive: Adaptive Thinking via Reinforcement Learning for Autonomous Driving

TL;DR

AdaThinkDrive advances autonomous driving by enabling adaptive thinking, switching between fast direct predictions and slow CoT reasoning based on scene complexity. It combines three-stage training (large-scale driving QA pretraining, dual-mode SFT with Think/Non-Think outputs, and GRPO-based reinforcement learning) with a four-component Adaptive Think Reward to learn when to reason. Empirical results on NAVSIM show state-of-the-art PDMS among vision-only methods, with notable gains from adaptive reasoning and reduced inference time relative to full-think baselines. The work demonstrates that selectively applying CoT can achieve superior planning accuracy and efficiency in diverse driving scenarios, supported by extensive ablations.

Abstract

While reasoning technology like Chain of Thought (CoT) has been widely adopted in Vision Language Action (VLA) models, it demonstrates promising capabilities in end to end autonomous driving. However, recent efforts to integrate CoT reasoning often fall short in simple scenarios, introducing unnecessary computational overhead without improving decision quality. To address this, we propose AdaThinkDrive, a novel VLA framework with a dual mode reasoning mechanism inspired by fast and slow thinking. First, our framework is pretrained on large scale autonomous driving (AD) scenarios using both question answering (QA) and trajectory datasets to acquire world knowledge and driving commonsense. During supervised fine tuning (SFT), we introduce a two mode dataset, fast answering (w/o CoT) and slow thinking (with CoT), enabling the model to distinguish between scenarios that require reasoning. Furthermore, an Adaptive Think Reward strategy is proposed in conjunction with the Group Relative Policy Optimization (GRPO), which rewards the model for selectively applying CoT by comparing trajectory quality across different reasoning modes. Extensive experiments on the Navsim benchmark show that AdaThinkDrive achieves a PDMS of 90.3, surpassing the best vision only baseline by 1.7 points. Moreover, ablations show that AdaThinkDrive surpasses both the never Think and always Think baselines, improving PDMS by 2.0 and 1.4, respectively. It also reduces inference time by 14% compared to the always Think baseline, demonstrating its ability to balance accuracy and efficiency through adaptive reasoning.

Paper Structure

This paper contains 29 sections, 7 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Impact and Design of Adaptive Reasoning in Trajectory Prediction.
  • Figure 2: We present AdaThinkDrive, an end-to-end autonomous driving framework that adaptively selects between "Thinking" and "Non-Thinking" modes depending on scene complexity. Given vision and text inputs, the VLM dynamically determines its output mode through an adaptive reasoning mechanism. During the reinforcement learning of the three-stage training process, multiple reward including PDMS, format, and endpoint are combined with the proposed Adaptive Think Reward.
  • Figure 3: Visualization of dynamic agents for Think-style CoT supervision. (a)-(b) depict CIPO-1 agents (occupying the ego lane) and CIPO-2 agents (likely to merge), while (c)-(d) show Motion Interaction cases where agents’ future trajectories intersect with the ego vehicle’s trajectory.
  • Figure 4: Adaptive Think Reward: A Dynamic Reasoning Control Strategy. This reward adjusts the model’s reasoning behavior by identifying misclassified scenes. When scene-specific conditions are satisfied, it assigns rewards to either Thinking or Non-thinking responses accordingly.
  • Figure 5: The ratio of Think vs. Non-Think choices by AdaThinkDrive across different NAVSIM Test dataset levels. Scene complexity increases progressively from Level 1 (simple) to Level 3 (challenging).
  • ...and 1 more figures