Table of Contents
Fetching ...

Prompting Multi-Modal Tokens to Enhance End-to-End Autonomous Driving Imitation Learning with LLMs

Yiqun Duan, Qiang Zhang, Renjing Xu

TL;DR

This work tackles integrating large language models (LLMs) into end-to-end autonomous driving by moving beyond pure-language planning to a hybrid framework that fuses visual and LiDAR data into learnable multi-modality tokens. It introduces a multi-modality joint token encoder, a structured prompting scheme that captures perception descriptions and driving actions, and a re-query plus reinforcement-guided tuning that helps the model correct mistakes in complex scenarios. Evaluations on CARLA LongSet6 show competitive offline driving metrics, with driving score $DS$ around $52.34$, route completion $RC$ around $92.37$, and infractions $IS$ around $0.60$, indicating the approach is on par with or close to state-of-the-art baselines. The work demonstrates the potential of LLMs to contribute driving reasoning and safety checks in E2E driving, while highlighting limitations in real-time applicability and need for further integration with online benchmarks.

Abstract

The utilization of Large Language Models (LLMs) within the realm of reinforcement learning, particularly as planners, has garnered a significant degree of attention in recent scholarly literature. However, a substantial proportion of existing research predominantly focuses on planning models for robotics that transmute the outputs derived from perception models into linguistic forms, thus adopting a `pure-language' strategy. In this research, we propose a hybrid End-to-End learning framework for autonomous driving by combining basic driving imitation learning with LLMs based on multi-modality prompt tokens. Instead of simply converting perception results from the separated train model into pure language input, our novelty lies in two aspects. 1) The end-to-end integration of visual and LiDAR sensory input into learnable multi-modality tokens, thereby intrinsically alleviating description bias by separated pre-trained perception models. 2) Instead of directly letting LLMs drive, this paper explores a hybrid setting of letting LLMs help the driving model correct mistakes and complicated scenarios. The results of our experiments suggest that the proposed methodology can attain driving scores of 49.21%, coupled with an impressive route completion rate of 91.34% in the offline evaluation conducted via CARLA. These performance metrics are comparable to the most advanced driving models.

Prompting Multi-Modal Tokens to Enhance End-to-End Autonomous Driving Imitation Learning with LLMs

TL;DR

This work tackles integrating large language models (LLMs) into end-to-end autonomous driving by moving beyond pure-language planning to a hybrid framework that fuses visual and LiDAR data into learnable multi-modality tokens. It introduces a multi-modality joint token encoder, a structured prompting scheme that captures perception descriptions and driving actions, and a re-query plus reinforcement-guided tuning that helps the model correct mistakes in complex scenarios. Evaluations on CARLA LongSet6 show competitive offline driving metrics, with driving score around , route completion around , and infractions around , indicating the approach is on par with or close to state-of-the-art baselines. The work demonstrates the potential of LLMs to contribute driving reasoning and safety checks in E2E driving, while highlighting limitations in real-time applicability and need for further integration with online benchmarks.

Abstract

The utilization of Large Language Models (LLMs) within the realm of reinforcement learning, particularly as planners, has garnered a significant degree of attention in recent scholarly literature. However, a substantial proportion of existing research predominantly focuses on planning models for robotics that transmute the outputs derived from perception models into linguistic forms, thus adopting a `pure-language' strategy. In this research, we propose a hybrid End-to-End learning framework for autonomous driving by combining basic driving imitation learning with LLMs based on multi-modality prompt tokens. Instead of simply converting perception results from the separated train model into pure language input, our novelty lies in two aspects. 1) The end-to-end integration of visual and LiDAR sensory input into learnable multi-modality tokens, thereby intrinsically alleviating description bias by separated pre-trained perception models. 2) Instead of directly letting LLMs drive, this paper explores a hybrid setting of letting LLMs help the driving model correct mistakes and complicated scenarios. The results of our experiments suggest that the proposed methodology can attain driving scores of 49.21%, coupled with an impressive route completion rate of 91.34% in the offline evaluation conducted via CARLA. These performance metrics are comparable to the most advanced driving models.
Paper Structure (14 sections, 3 equations, 5 figures, 2 tables)

This paper contains 14 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of utilizing language model for autonomous driving. The system takes the camera and LiDAR input and extracts shallow feature maps with two branches. Then a joint Swin transformer encodes perception input into joint token representation. The prompt is constructed by sequentially concatenating multi-modality joint tokens, status repeat tokens, and driving task tokens. The task prompts consist of two kinds 1) directly predicting the driving actions and 2) driving action correction given the driving output. The controller model also predicts an uncertainty score by an MLP layer, which decides whether to ask GPT to correction or drive by itself. The model is trained by auto-regressively predicting perception description and the driving action. Driving actions are executed by a final controller.
  • Figure 2: Illustration of joint token representation, the perception token is aligned with segmentation embedding and position embedding to distinguish from normal word token.
  • Figure 3: Illustration of the prompt construction. Both traditional perception supervision and driving actions are converted into language descriptions through a descriptor.
  • Figure 4: Illustration of the re-query mechanism when there is disagreement.
  • Figure 5: Visualization of driving states between traditional methods and language model.