Table of Contents
Fetching ...

CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving

Zhaohui Wang, Tengbo Yu, Hao Tang

TL;DR

CoT4AD introduces explicit chain-of-thought reasoning into a vision-language-action model for autonomous driving, integrating 3D perception, VLM-based prompting, and diffusion-based planning to achieve principled, multi-step reasoning and robust end-to-end decisions. The framework trains perception, VQA, future prediction, and planning in a unified, CoT-aligned manner while enabling fast inference with implicit CoT. Extensive experiments on nuScenes and Bench2Drive show state-of-the-art open-loop and closed-loop performance, with ablations highlighting the importance of future-state prediction and multi-modal perception tokens. This work advances interpretable, robust end-to-end driving by embedding structured reasoning into multi-modal perception and trajectory planning.

Abstract

Vision-Language-Action (VLA) models have recently attracted growing attention in end-to-end autonomous driving for their strong reasoning capabilities and rich world knowledge. However, existing VLAs often suffer from limited numerical reasoning ability and overly simplified input-output mappings, which hinder their performance in complex driving scenarios requiring step-by-step causal reasoning. To address these challenges, we propose CoT4AD, a novel VLA framework that introduces Chain-of-Thought (CoT) reasoning for autonomous driving to enhance both numerical and causal reasoning in Vision-Language Models (VLMs). CoT4AD integrates visual observations and language instructions to perform semantic reasoning, scene understanding, and trajectory planning. During training, it explicitly models a perception-question-prediction-action CoT to align the reasoning space with the action space across multiple driving tasks. During inference, it performs implicit CoT reasoning to enable consistent numerical reasoning and robust decision-making in dynamic environments. Extensive experiments on both real-world and simulated benchmarks, including nuScenes and Bench2Drive, demonstrate that CoT4AD achieves state-of-the-art performance in both open-loop and closed-loop evaluations. Code will be released upon paper acceptance.

CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving

TL;DR

CoT4AD introduces explicit chain-of-thought reasoning into a vision-language-action model for autonomous driving, integrating 3D perception, VLM-based prompting, and diffusion-based planning to achieve principled, multi-step reasoning and robust end-to-end decisions. The framework trains perception, VQA, future prediction, and planning in a unified, CoT-aligned manner while enabling fast inference with implicit CoT. Extensive experiments on nuScenes and Bench2Drive show state-of-the-art open-loop and closed-loop performance, with ablations highlighting the importance of future-state prediction and multi-modal perception tokens. This work advances interpretable, robust end-to-end driving by embedding structured reasoning into multi-modal perception and trajectory planning.

Abstract

Vision-Language-Action (VLA) models have recently attracted growing attention in end-to-end autonomous driving for their strong reasoning capabilities and rich world knowledge. However, existing VLAs often suffer from limited numerical reasoning ability and overly simplified input-output mappings, which hinder their performance in complex driving scenarios requiring step-by-step causal reasoning. To address these challenges, we propose CoT4AD, a novel VLA framework that introduces Chain-of-Thought (CoT) reasoning for autonomous driving to enhance both numerical and causal reasoning in Vision-Language Models (VLMs). CoT4AD integrates visual observations and language instructions to perform semantic reasoning, scene understanding, and trajectory planning. During training, it explicitly models a perception-question-prediction-action CoT to align the reasoning space with the action space across multiple driving tasks. During inference, it performs implicit CoT reasoning to enable consistent numerical reasoning and robust decision-making in dynamic environments. Extensive experiments on both real-world and simulated benchmarks, including nuScenes and Bench2Drive, demonstrate that CoT4AD achieves state-of-the-art performance in both open-loop and closed-loop evaluations. Code will be released upon paper acceptance.

Paper Structure

This paper contains 18 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison with existing frameworks. (a) Classic E2E methods map sensor inputs to control outputs with modular designs. (b) Classic VLA methods incorporate language reasoning and VLMs to map sensor inputs to action space for planning. (c) CoT4AD (Outs) incorporate CoT reasoning for VLMs to enable explicit mult-step reasoning of planning.
  • Figure 2: (a) Architecture of CoT4AD. It consists of four stages of CoT reasoning including: 3D perception, VQA, VLM-conditioned diffusion and planning. (b) VLM-Conditioned Latent Diffusion. A conditional latent DiT model diffuses the latent of the current frame conditioned on VLM embeddings, and the future frame is reconstructed via a VAE decoder.
  • Figure 3: Qualitative results of CoT4AD on the Bench2Drive closed-loop evaluation set.
  • Figure 4: Ablation on the number of predicted future scenes