Prompting Multi-Modal Tokens to Enhance End-to-End Autonomous Driving Imitation Learning with LLMs
Yiqun Duan, Qiang Zhang, Renjing Xu
TL;DR
This work tackles integrating large language models (LLMs) into end-to-end autonomous driving by moving beyond pure-language planning to a hybrid framework that fuses visual and LiDAR data into learnable multi-modality tokens. It introduces a multi-modality joint token encoder, a structured prompting scheme that captures perception descriptions and driving actions, and a re-query plus reinforcement-guided tuning that helps the model correct mistakes in complex scenarios. Evaluations on CARLA LongSet6 show competitive offline driving metrics, with driving score $DS$ around $52.34$, route completion $RC$ around $92.37$, and infractions $IS$ around $0.60$, indicating the approach is on par with or close to state-of-the-art baselines. The work demonstrates the potential of LLMs to contribute driving reasoning and safety checks in E2E driving, while highlighting limitations in real-time applicability and need for further integration with online benchmarks.
Abstract
The utilization of Large Language Models (LLMs) within the realm of reinforcement learning, particularly as planners, has garnered a significant degree of attention in recent scholarly literature. However, a substantial proportion of existing research predominantly focuses on planning models for robotics that transmute the outputs derived from perception models into linguistic forms, thus adopting a `pure-language' strategy. In this research, we propose a hybrid End-to-End learning framework for autonomous driving by combining basic driving imitation learning with LLMs based on multi-modality prompt tokens. Instead of simply converting perception results from the separated train model into pure language input, our novelty lies in two aspects. 1) The end-to-end integration of visual and LiDAR sensory input into learnable multi-modality tokens, thereby intrinsically alleviating description bias by separated pre-trained perception models. 2) Instead of directly letting LLMs drive, this paper explores a hybrid setting of letting LLMs help the driving model correct mistakes and complicated scenarios. The results of our experiments suggest that the proposed methodology can attain driving scores of 49.21%, coupled with an impressive route completion rate of 91.34% in the offline evaluation conducted via CARLA. These performance metrics are comparable to the most advanced driving models.
