DriveAgent-R1: Advancing VLM-based Autonomous Driving with Active Perception and Hybrid Thinking
Weicheng Zheng, Xiaofei Mao, Nanfei Ye, Pengxiang Li, Kun Zhan, Xianpeng Lang, Hang Zhao
TL;DR
DriveAgent-R1 introduces an active-perception autonomous driving agent that Grounding high-level planning in on-demand visual evidence via a Vision Toolkit. It deploys a hybrid-thinking framework that adaptively switches between efficient text-based reasoning and robust tool-augmented visual reasoning, trained through a three-stage progressive curriculum (SFT, FCM-RL, AMS-RL) and guided by a domain-aligned visual prior. Empirical results on Drive-Internal and nuScenes show 3B-parameter DriveAgent-R1 achieving competitive performance with GPT-5 and human drivers, with clear gains from adaptive mode selection and tool use, while improving interpretability and safety grounding. The work also provides extensive ablations and analyses on domain alignment, progressive training, and active versus passive perception, demonstrating both the benefits and limitations of vision-grounded planning in real-world driving scenarios.
Abstract
The advent of Vision-Language Models (VLMs) has significantly advanced end-to-end autonomous driving, demonstrating powerful reasoning abilities for high-level behavior planning tasks. However, existing methods are often constrained by a passive perception paradigm, relying solely on text-based reasoning. This passivity restricts the model's capacity to actively seek crucial visual evidence when faced with uncertainty. To address this, we introduce DriveAgent-R1, the first autonomous driving agent capable of active perception for planning. In complex scenarios, DriveAgent-R1 proactively invokes tools to perform visual reasoning, firmly grounding its decisions in visual evidence, thereby enhancing both interpretability and reliability. Furthermore, we propose a hybrid thinking framework, inspired by human driver cognitive patterns, allowing the agent to adaptively switch between efficient text-only reasoning and robust tool-augmented visual reasoning based on scene complexity. This capability is cultivated through a three-stage progressive training strategy, featuring a core Cascaded Reinforcement Learning (Cascaded RL) phase. Extensive experiments on the Drive-Internal dataset, which is rich in long-tail scenarios, and the public nuScenes dataset show that, with only 3B parameters, DriveAgent-R1 achieves competitive performance comparable to top closed model systems such as GPT-5 and to human driving proficiency while remaining deployment-friendly, offering a proven path toward building more intelligent autonomous driving systems.
