Table of Contents
Fetching ...

AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving

Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, Yining Shi, He Zhe Lim, Li Liu, Tianbao Zhou, Huang Yu, Yifei Hu, Guang Li, Guang Chen, Hao Ye, Lijun Sun, Diange Yang

TL;DR

AgentThink presents a unified framework that couples chain-of-thought reasoning with dynamic, agent-style tool invocation to enhance vision-language models for autonomous driving. It introduces a structured Tool-Augmented data generation pipeline, a two-stage training regime (SFT followed by GRPO-based RL fine-tuning with a multi-component reward), and a robust inference/evaluation protocol focused on tool usage quality. Empirical results on DriveLMM-o1 show substantial gains in reasoning and answer accuracy, with strong zero-shot and few-shot generalization across driving benchmarks. The approach improves interpretability and reduces hallucinations by grounding reasoning steps in external tool outputs, signaling a promising direction for trustworthy, tool-aware driving systems.

Abstract

Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce \textbf{AgentThink}, a pioneering unified framework that integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink's core innovations include: \textbf{(i) Structured Data Generation}, which establishes an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; \textbf{(ii) A Two-stage Training Pipeline}, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and \textbf{(iii) Agent-style Tool-Usage Evaluation}, introducing a novel multi-tool assessment protocol to rigorously evaluate the model's tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate that AgentThink significantly boosts overall reasoning scores by \textbf{53.91%} and enhances answer accuracy by \textbf{33.54%}, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models. Code is available at https://github.com/curryqka/AgentThink.

AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving

TL;DR

AgentThink presents a unified framework that couples chain-of-thought reasoning with dynamic, agent-style tool invocation to enhance vision-language models for autonomous driving. It introduces a structured Tool-Augmented data generation pipeline, a two-stage training regime (SFT followed by GRPO-based RL fine-tuning with a multi-component reward), and a robust inference/evaluation protocol focused on tool usage quality. Empirical results on DriveLMM-o1 show substantial gains in reasoning and answer accuracy, with strong zero-shot and few-shot generalization across driving benchmarks. The approach improves interpretability and reduces hallucinations by grounding reasoning steps in external tool outputs, signaling a promising direction for trustworthy, tool-aware driving systems.

Abstract

Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce \textbf{AgentThink}, a pioneering unified framework that integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink's core innovations include: \textbf{(i) Structured Data Generation}, which establishes an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; \textbf{(ii) A Two-stage Training Pipeline}, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and \textbf{(iii) Agent-style Tool-Usage Evaluation}, introducing a novel multi-tool assessment protocol to rigorously evaluate the model's tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate that AgentThink significantly boosts overall reasoning scores by \textbf{53.91%} and enhances answer accuracy by \textbf{33.54%}, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models. Code is available at https://github.com/curryqka/AgentThink.

Paper Structure

This paper contains 39 sections, 11 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: The performance of proposed AgentThink framework on the DriveLMM-o1 benchmark.
  • Figure 2: Illustration of the motivation and key highlights of our proposed framework. (a) Base VLMs use static input-output mapping with no reasoning, leading to low accuracy and frequent hallucinations. (b) VLM + CoT introduces structured reasoning, improving interpretability, but still suffers from inconsistencies and lack of verification. (c) AgentThink (Ours) augments CoT with dynamic tool use, enhancing accuracy, reducing hallucinations, and improving reasoning consistency through external verification.
  • Figure 3: AgentThink Framework Architecture. (i) Structured and scalable data generation pipeline that constructs tool-augmented reasoning; (ii) Two-stage training pipeline that first performs SFT and then applies GRPO to improve reasoning and tool-use behavior; and (iii) Unified inference and evaluation protocol that dynamically invokes tools and assesses final answers based on reasoning completeness, consistency, and tool-use effectiveness.
  • Figure 4: The model generates structured reasoning chains, dynamically invokes external tools to resolve uncertainties (e.g., object detection, trajectory prediction, lane width), and concludes with an interpretable action recommendation.
  • Figure 5: Zero-shot qualitative comparison with Qwen2.5VL-7B on BDD-X, Navsim, DriveBench and DriveMLLM.
  • ...and 3 more figures