PKRD-CoT: A Unified Chain-of-thought Prompting for Multi-Modal Large Language Models in Autonomous Driving

Xuewen Luo; Fan Ding; Yinsheng Song; Xiaofeng Zhang; Junnyong Loo

PKRD-CoT: A Unified Chain-of-thought Prompting for Multi-Modal Large Language Models in Autonomous Driving

Xuewen Luo, Fan Ding, Yinsheng Song, Xiaofeng Zhang, Junnyong Loo

TL;DR

PKRD-CoT addresses the high cost and complexity of end-to-end autonomous driving models by using zero-shot chain-of-thought prompting to activate four driving capabilities: perception, knowledge, reasoning, and decision-making. The framework employs a memory-JSON module to maintain context while guiding MLLMs through perception, knowledge integration, and action selection. Across GPT-4, Claude, LLava1.6, Qwen-VL-Plus, CogVLM, and MiniGPT-4, the approach is evaluated on perception, mathematical reasoning, and decision-making tasks, with ablation showing superiority over zero-shot and role-playing prompts. The results support a cost-efficient pathway to deploying MLLMs in real-time autonomous driving and highlight model-specific strengths and limitations for further research.

Abstract

There is growing interest in leveraging the capabilities of robust Multi-Modal Large Language Models (MLLMs) directly within autonomous driving contexts. However, the high costs and complexity of designing and training end-to-end autonomous driving models make them challenging for many enterprises and research entities. To address this, our study explores a seamless integration of MLLMs into autonomous driving systems by proposing a Zero-Shot Chain-of-Thought (Zero-Shot-CoT) prompt design named PKRD-CoT. PKRD-CoT is based on the four fundamental capabilities of autonomous driving: perception, knowledge, reasoning, and decision-making. This makes it particularly suitable for understanding and responding to dynamic driving environments by mimicking human thought processes step by step, thus enhancing decision-making in real-time scenarios. Our design enables MLLMs to tackle problems without prior experience, thereby increasing their utility within unstructured autonomous driving environments. In experiments, we demonstrate the exceptional performance of GPT-4.0 with PKRD-CoT across autonomous driving tasks, highlighting its effectiveness in autonomous driving scenarios. Additionally, our benchmark analysis reveals the promising viability of PKRD-CoT for other MLLMs, such as Claude, LLava1.6, and Qwen-VL-Plus. Overall, this study contributes a novel and unified prompt-design framework for GPT-4.0 and other MLLMs in autonomous driving, while also rigorously evaluating the efficacy of these widely recognized MLLMs in the autonomous driving domain through comprehensive comparisons.

PKRD-CoT: A Unified Chain-of-thought Prompting for Multi-Modal Large Language Models in Autonomous Driving

TL;DR

Abstract

PKRD-CoT: A Unified Chain-of-thought Prompting for Multi-Modal Large Language Models in Autonomous Driving

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)