Table of Contents
Fetching ...

PKRD-CoT: A Unified Chain-of-thought Prompting for Multi-Modal Large Language Models in Autonomous Driving

Xuewen Luo, Fan Ding, Yinsheng Song, Xiaofeng Zhang, Junnyong Loo

TL;DR

PKRD-CoT addresses the high cost and complexity of end-to-end autonomous driving models by using zero-shot chain-of-thought prompting to activate four driving capabilities: perception, knowledge, reasoning, and decision-making. The framework employs a memory-JSON module to maintain context while guiding MLLMs through perception, knowledge integration, and action selection. Across GPT-4, Claude, LLava1.6, Qwen-VL-Plus, CogVLM, and MiniGPT-4, the approach is evaluated on perception, mathematical reasoning, and decision-making tasks, with ablation showing superiority over zero-shot and role-playing prompts. The results support a cost-efficient pathway to deploying MLLMs in real-time autonomous driving and highlight model-specific strengths and limitations for further research.

Abstract

There is growing interest in leveraging the capabilities of robust Multi-Modal Large Language Models (MLLMs) directly within autonomous driving contexts. However, the high costs and complexity of designing and training end-to-end autonomous driving models make them challenging for many enterprises and research entities. To address this, our study explores a seamless integration of MLLMs into autonomous driving systems by proposing a Zero-Shot Chain-of-Thought (Zero-Shot-CoT) prompt design named PKRD-CoT. PKRD-CoT is based on the four fundamental capabilities of autonomous driving: perception, knowledge, reasoning, and decision-making. This makes it particularly suitable for understanding and responding to dynamic driving environments by mimicking human thought processes step by step, thus enhancing decision-making in real-time scenarios. Our design enables MLLMs to tackle problems without prior experience, thereby increasing their utility within unstructured autonomous driving environments. In experiments, we demonstrate the exceptional performance of GPT-4.0 with PKRD-CoT across autonomous driving tasks, highlighting its effectiveness in autonomous driving scenarios. Additionally, our benchmark analysis reveals the promising viability of PKRD-CoT for other MLLMs, such as Claude, LLava1.6, and Qwen-VL-Plus. Overall, this study contributes a novel and unified prompt-design framework for GPT-4.0 and other MLLMs in autonomous driving, while also rigorously evaluating the efficacy of these widely recognized MLLMs in the autonomous driving domain through comprehensive comparisons.

PKRD-CoT: A Unified Chain-of-thought Prompting for Multi-Modal Large Language Models in Autonomous Driving

TL;DR

PKRD-CoT addresses the high cost and complexity of end-to-end autonomous driving models by using zero-shot chain-of-thought prompting to activate four driving capabilities: perception, knowledge, reasoning, and decision-making. The framework employs a memory-JSON module to maintain context while guiding MLLMs through perception, knowledge integration, and action selection. Across GPT-4, Claude, LLava1.6, Qwen-VL-Plus, CogVLM, and MiniGPT-4, the approach is evaluated on perception, mathematical reasoning, and decision-making tasks, with ablation showing superiority over zero-shot and role-playing prompts. The results support a cost-efficient pathway to deploying MLLMs in real-time autonomous driving and highlight model-specific strengths and limitations for further research.

Abstract

There is growing interest in leveraging the capabilities of robust Multi-Modal Large Language Models (MLLMs) directly within autonomous driving contexts. However, the high costs and complexity of designing and training end-to-end autonomous driving models make them challenging for many enterprises and research entities. To address this, our study explores a seamless integration of MLLMs into autonomous driving systems by proposing a Zero-Shot Chain-of-Thought (Zero-Shot-CoT) prompt design named PKRD-CoT. PKRD-CoT is based on the four fundamental capabilities of autonomous driving: perception, knowledge, reasoning, and decision-making. This makes it particularly suitable for understanding and responding to dynamic driving environments by mimicking human thought processes step by step, thus enhancing decision-making in real-time scenarios. Our design enables MLLMs to tackle problems without prior experience, thereby increasing their utility within unstructured autonomous driving environments. In experiments, we demonstrate the exceptional performance of GPT-4.0 with PKRD-CoT across autonomous driving tasks, highlighting its effectiveness in autonomous driving scenarios. Additionally, our benchmark analysis reveals the promising viability of PKRD-CoT for other MLLMs, such as Claude, LLava1.6, and Qwen-VL-Plus. Overall, this study contributes a novel and unified prompt-design framework for GPT-4.0 and other MLLMs in autonomous driving, while also rigorously evaluating the efficacy of these widely recognized MLLMs in the autonomous driving domain through comprehensive comparisons.

Paper Structure

This paper contains 14 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The MLLM model operates as a driver agent within the PKRD-CoT paradigm, including a dynamic environment, an MLLM model with capabilities in perception, knowledge, reasoning, and decision-making, and a memory module that stores information in JSON format. The MLLM model continuously senses the environment, recognizes targets, reasons about situations, and interacts with the memory module to make decisions for controlling the car.
  • Figure 2: Example Outputs of GPT4.0 with PKRD-CoT in Autonomous Driving
  • Figure 3: Example of Ablation Experiment Outputs
  • Figure 4: CogVLM chat: Positioning of Car and Pedestrian
  • Figure 5: Visual Results of Knowledge Ability Comparative Experiments
  • ...and 1 more figures