Table of Contents
Fetching ...

DriveAgent-R1: Advancing VLM-based Autonomous Driving with Active Perception and Hybrid Thinking

Weicheng Zheng, Xiaofei Mao, Nanfei Ye, Pengxiang Li, Kun Zhan, Xianpeng Lang, Hang Zhao

TL;DR

DriveAgent-R1 introduces an active-perception autonomous driving agent that Grounding high-level planning in on-demand visual evidence via a Vision Toolkit. It deploys a hybrid-thinking framework that adaptively switches between efficient text-based reasoning and robust tool-augmented visual reasoning, trained through a three-stage progressive curriculum (SFT, FCM-RL, AMS-RL) and guided by a domain-aligned visual prior. Empirical results on Drive-Internal and nuScenes show 3B-parameter DriveAgent-R1 achieving competitive performance with GPT-5 and human drivers, with clear gains from adaptive mode selection and tool use, while improving interpretability and safety grounding. The work also provides extensive ablations and analyses on domain alignment, progressive training, and active versus passive perception, demonstrating both the benefits and limitations of vision-grounded planning in real-world driving scenarios.

Abstract

The advent of Vision-Language Models (VLMs) has significantly advanced end-to-end autonomous driving, demonstrating powerful reasoning abilities for high-level behavior planning tasks. However, existing methods are often constrained by a passive perception paradigm, relying solely on text-based reasoning. This passivity restricts the model's capacity to actively seek crucial visual evidence when faced with uncertainty. To address this, we introduce DriveAgent-R1, the first autonomous driving agent capable of active perception for planning. In complex scenarios, DriveAgent-R1 proactively invokes tools to perform visual reasoning, firmly grounding its decisions in visual evidence, thereby enhancing both interpretability and reliability. Furthermore, we propose a hybrid thinking framework, inspired by human driver cognitive patterns, allowing the agent to adaptively switch between efficient text-only reasoning and robust tool-augmented visual reasoning based on scene complexity. This capability is cultivated through a three-stage progressive training strategy, featuring a core Cascaded Reinforcement Learning (Cascaded RL) phase. Extensive experiments on the Drive-Internal dataset, which is rich in long-tail scenarios, and the public nuScenes dataset show that, with only 3B parameters, DriveAgent-R1 achieves competitive performance comparable to top closed model systems such as GPT-5 and to human driving proficiency while remaining deployment-friendly, offering a proven path toward building more intelligent autonomous driving systems.

DriveAgent-R1: Advancing VLM-based Autonomous Driving with Active Perception and Hybrid Thinking

TL;DR

DriveAgent-R1 introduces an active-perception autonomous driving agent that Grounding high-level planning in on-demand visual evidence via a Vision Toolkit. It deploys a hybrid-thinking framework that adaptively switches between efficient text-based reasoning and robust tool-augmented visual reasoning, trained through a three-stage progressive curriculum (SFT, FCM-RL, AMS-RL) and guided by a domain-aligned visual prior. Empirical results on Drive-Internal and nuScenes show 3B-parameter DriveAgent-R1 achieving competitive performance with GPT-5 and human drivers, with clear gains from adaptive mode selection and tool use, while improving interpretability and safety grounding. The work also provides extensive ablations and analyses on domain alignment, progressive training, and active versus passive perception, demonstrating both the benefits and limitations of vision-grounded planning in real-world driving scenarios.

Abstract

The advent of Vision-Language Models (VLMs) has significantly advanced end-to-end autonomous driving, demonstrating powerful reasoning abilities for high-level behavior planning tasks. However, existing methods are often constrained by a passive perception paradigm, relying solely on text-based reasoning. This passivity restricts the model's capacity to actively seek crucial visual evidence when faced with uncertainty. To address this, we introduce DriveAgent-R1, the first autonomous driving agent capable of active perception for planning. In complex scenarios, DriveAgent-R1 proactively invokes tools to perform visual reasoning, firmly grounding its decisions in visual evidence, thereby enhancing both interpretability and reliability. Furthermore, we propose a hybrid thinking framework, inspired by human driver cognitive patterns, allowing the agent to adaptively switch between efficient text-only reasoning and robust tool-augmented visual reasoning based on scene complexity. This capability is cultivated through a three-stage progressive training strategy, featuring a core Cascaded Reinforcement Learning (Cascaded RL) phase. Extensive experiments on the Drive-Internal dataset, which is rich in long-tail scenarios, and the public nuScenes dataset show that, with only 3B parameters, DriveAgent-R1 achieves competitive performance comparable to top closed model systems such as GPT-5 and to human driving proficiency while remaining deployment-friendly, offering a proven path toward building more intelligent autonomous driving systems.

Paper Structure

This paper contains 48 sections, 10 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: An illustration of DriveAgent-R1's active perception capability. The agent proactively uses RoI Inspection to clarify an uncertain scene, discovering a minor collision between the vehicles ahead. This active perception corrects its initial assessment, leading to a safe plan to decelerate and then stop based on direct visual evidence. More visualization is shown in the Appendix \ref{['Qualitative_Results']}.
  • Figure 2: The Hybrid-Thinking architecture of DriveAgent-R1. For simple scenarios (Top), the agent uses direct text-based reasoning ($T_1 \to A$). For complex scenarios (Bottom), it iteratively interleaves thoughts ($T_k$) with tool calls to a Vision Toolkit, acquiring new visual evidence ($I_k$) to refine its decision-making. The detailed visulization of this case is shown in Fig. \ref{['fig:case-tool-1']}, Appendix \ref{['Qualitative_Results']}.
  • Figure 2: Analysis of Foundational Capabilities. Overall scores on domain-specific and 8 general VLM benchmarks. Numbers in blue denote the improvement over Qwen2.5-VL-3B. Detail results are in Appendix \ref{['app:domain_alignment_details']}.
  • Figure 3: The progressive three-stage training strategy for DriveAgent-R1. The process begins with (1) DM-SFT to establish a foundational understanding of both thinking modes. This is followed by a core Cascaded RL phase, where (2) FCM-RL strengthens each mode independently, and (3) AMS-RL trains the agent to adaptively select the optimal mode.
  • Figure 4: Progressive training gains on Drive-Internal$_\text{test}$. Accuracy in $\mathcal{M}_\text{adaptive}$ mode and MSA improve with each training stage.
  • ...and 11 more figures