Table of Contents
Fetching ...

CoT-Drive: Efficient Motion Forecasting for Autonomous Driving with LLMs and Chain-of-Thought Prompting

Haicheng Liao, Hanlin Kong, Bonan Wang, Chengyue Wang, Wang Ye, Zhengbing He, Chengzhong Xu, Zhenning Li

TL;DR

CoT-Drive tackles robust, real-time motion forecasting for autonomous driving by transferring the reasoning strengths of large language models to lightweight edge models through chain-of-thought prompting and a teacher-student distillation pipeline. It introduces Highway-Text and Urban-Text to train compact LMs to generate semantic scene annotations, and employs a four-module encoder-decoder with multimodal fusion and uncertainty modeling to predict multimodal trajectories. Across five real-world datasets, CoT-Drive outperforms state-of-the-art baselines while maintaining practical edge-device latency, demonstrating the practicality of LLM-inspired scene understanding in AD. The work offers a scalable path toward explainable, generalizable motion forecasting on resource-constrained platforms, combining prompt-engineered linguistic reasoning with efficient edge inference.

Abstract

Accurate motion forecasting is crucial for safe autonomous driving (AD). This study proposes CoT-Drive, a novel approach that enhances motion forecasting by leveraging large language models (LLMs) and a chain-of-thought (CoT) prompting method. We introduce a teacher-student knowledge distillation strategy to effectively transfer LLMs' advanced scene understanding capabilities to lightweight language models (LMs), ensuring that CoT-Drive operates in real-time on edge devices while maintaining comprehensive scene understanding and generalization capabilities. By leveraging CoT prompting techniques for LLMs without additional training, CoT-Drive generates semantic annotations that significantly improve the understanding of complex traffic environments, thereby boosting the accuracy and robustness of predictions. Additionally, we present two new scene description datasets, Highway-Text and Urban-Text, designed for fine-tuning lightweight LMs to generate context-specific semantic annotations. Comprehensive evaluations of five real-world datasets demonstrate that CoT-Drive outperforms existing models, highlighting its effectiveness and efficiency in handling complex traffic scenarios. Overall, this study is the first to consider the practical application of LLMs in this field. It pioneers the training and use of a lightweight LLM surrogate for motion forecasting, setting a new benchmark and showcasing the potential of integrating LLMs into AD systems.

CoT-Drive: Efficient Motion Forecasting for Autonomous Driving with LLMs and Chain-of-Thought Prompting

TL;DR

CoT-Drive tackles robust, real-time motion forecasting for autonomous driving by transferring the reasoning strengths of large language models to lightweight edge models through chain-of-thought prompting and a teacher-student distillation pipeline. It introduces Highway-Text and Urban-Text to train compact LMs to generate semantic scene annotations, and employs a four-module encoder-decoder with multimodal fusion and uncertainty modeling to predict multimodal trajectories. Across five real-world datasets, CoT-Drive outperforms state-of-the-art baselines while maintaining practical edge-device latency, demonstrating the practicality of LLM-inspired scene understanding in AD. The work offers a scalable path toward explainable, generalizable motion forecasting on resource-constrained platforms, combining prompt-engineered linguistic reasoning with efficient edge inference.

Abstract

Accurate motion forecasting is crucial for safe autonomous driving (AD). This study proposes CoT-Drive, a novel approach that enhances motion forecasting by leveraging large language models (LLMs) and a chain-of-thought (CoT) prompting method. We introduce a teacher-student knowledge distillation strategy to effectively transfer LLMs' advanced scene understanding capabilities to lightweight language models (LMs), ensuring that CoT-Drive operates in real-time on edge devices while maintaining comprehensive scene understanding and generalization capabilities. By leveraging CoT prompting techniques for LLMs without additional training, CoT-Drive generates semantic annotations that significantly improve the understanding of complex traffic environments, thereby boosting the accuracy and robustness of predictions. Additionally, we present two new scene description datasets, Highway-Text and Urban-Text, designed for fine-tuning lightweight LMs to generate context-specific semantic annotations. Comprehensive evaluations of five real-world datasets demonstrate that CoT-Drive outperforms existing models, highlighting its effectiveness and efficiency in handling complex traffic scenarios. Overall, this study is the first to consider the practical application of LLMs in this field. It pioneers the training and use of a lightweight LLM surrogate for motion forecasting, setting a new benchmark and showcasing the potential of integrating LLMs into AD systems.

Paper Structure

This paper contains 51 sections, 17 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Illustration of the strength of COT-Drive (d), comparing edge LMs (a), local LLMs with edge LMs (b), and online LLMs with edge LMs (c) on key perspectives: response time, security, adaptability, and scene understanding capabilities.
  • Figure 2: Illustration of the chain-of-thought prompting used in our proposed datasets to generate semantic annotations for a given traffic scene. The dialogue progression is methodically structured under human-like cognitive processes that include Background and Statistics, Interaction Analysis, Risk Assessment, and Prediction. Within each thematic category (step), we systematically infuse the LLM with specific knowledge and illustrative examples.
  • Figure 3: Overall pipeline of CoT-Drive. Panel (a) illustrates the encoder-decoder architecture of CoT-Drive, comprising four main modules: Language-Instructed Encoder, Interaction-aware Encoder, Cross-modal Encoder, and Decoder. Panels (b-1) and (b-2) illustrate the workflows of the Language-Instructed Encoder and the training process for the edge LM. This training involves multimodal fusion of semantic annotations and spatio-temporal data, where annotations are generated by a fine-tuned LM. The edge LM is trained on real-world text data labeled through CoT prompting-enhanced GPT-4 Turbo, allowing it to inherit the rich contextual learning capabilities of LLMs. Panel (c) illustrates the Decoder, which utilizes a deep ensemble method to handle aleatoric and epistemic uncertainties, combining Gaussian Mixture Models for maneuver-based predictions.
  • Figure 4: Validation loss curves for four LMs on the developed datasets: (a1)-(a4) show the loss curves for Phi-1.5, TinyLlama, Qwen-1.5, and GPT-Neo on Urban-Text; (b1)-(b4) show their corresponding validation loss on Highway-Text dataset.
  • Figure 5: Comparison of four different LMs in Parameter Count (a) and Performance on Urban-Text (b) and Highway-Text (c). Note: F1 Score is the evaluation metric.
  • ...and 5 more figures