Table of Contents
Fetching ...

Generalizing End-To-End Autonomous Driving In Real-World Environments Using Zero-Shot LLMs

Zeyu Dong, Yimin Zhu, Yansong Li, Kevin Mahon, Yu Sun

TL;DR

An efficient architecture that integrates multimodal LLMs into end-to-end driving models operating in closed-loop settings in real-world environments is proposed and shows that the proposed architecture enhances the generalization capabilities of the end-to-end model even without fine-tuning the LLM.

Abstract

Traditional autonomous driving methods adopt a modular design, decomposing tasks into sub-tasks. In contrast, end-to-end autonomous driving directly outputs actions from raw sensor data, avoiding error accumulation. However, training an end-to-end model requires a comprehensive dataset; otherwise, the model exhibits poor generalization capabilities. Recently, large language models (LLMs) have been applied to enhance the generalization capabilities of end-to-end driving models. Most studies explore LLMs in an open-loop manner, where the output actions are compared to those of experts without direct feedback from the real world, while others examine closed-loop results only in simulations. This paper proposes an efficient architecture that integrates multimodal LLMs into end-to-end driving models operating in closed-loop settings in real-world environments. In our architecture, the LLM periodically processes raw sensor data to generate high-level driving instructions, effectively guiding the end-to-end model, even at a slower rate than the raw sensor data. This architecture relaxes the trade-off between the latency and inference quality of the LLM. It also allows us to choose from a wide variety of LLMs to improve high-level driving instructions and minimize fine-tuning costs. Consequently, our architecture reduces data collection requirements because the LLMs do not directly output actions; we only need to train a simple imitation learning model to output actions. In our experiments, the training data for the end-to-end model in a real-world environment consists of only simple obstacle configurations with one traffic cone, while the test environment is more complex and contains multiple obstacles placed in various positions. Experiments show that the proposed architecture enhances the generalization capabilities of the end-to-end model even without fine-tuning the LLM.

Generalizing End-To-End Autonomous Driving In Real-World Environments Using Zero-Shot LLMs

TL;DR

An efficient architecture that integrates multimodal LLMs into end-to-end driving models operating in closed-loop settings in real-world environments is proposed and shows that the proposed architecture enhances the generalization capabilities of the end-to-end model even without fine-tuning the LLM.

Abstract

Traditional autonomous driving methods adopt a modular design, decomposing tasks into sub-tasks. In contrast, end-to-end autonomous driving directly outputs actions from raw sensor data, avoiding error accumulation. However, training an end-to-end model requires a comprehensive dataset; otherwise, the model exhibits poor generalization capabilities. Recently, large language models (LLMs) have been applied to enhance the generalization capabilities of end-to-end driving models. Most studies explore LLMs in an open-loop manner, where the output actions are compared to those of experts without direct feedback from the real world, while others examine closed-loop results only in simulations. This paper proposes an efficient architecture that integrates multimodal LLMs into end-to-end driving models operating in closed-loop settings in real-world environments. In our architecture, the LLM periodically processes raw sensor data to generate high-level driving instructions, effectively guiding the end-to-end model, even at a slower rate than the raw sensor data. This architecture relaxes the trade-off between the latency and inference quality of the LLM. It also allows us to choose from a wide variety of LLMs to improve high-level driving instructions and minimize fine-tuning costs. Consequently, our architecture reduces data collection requirements because the LLMs do not directly output actions; we only need to train a simple imitation learning model to output actions. In our experiments, the training data for the end-to-end model in a real-world environment consists of only simple obstacle configurations with one traffic cone, while the test environment is more complex and contains multiple obstacles placed in various positions. Experiments show that the proposed architecture enhances the generalization capabilities of the end-to-end model even without fine-tuning the LLM.

Paper Structure

This paper contains 24 sections, 1 equation, 5 figures, 15 tables.

Figures (5)

  • Figure 1: General idea of the architecture. Top left: the training dataset for the end-to-end model. The purple line represents the trajectories of the car for this route. Top right: an answer generated by ChatGPT-4o using the image in the bottom right. Bottom left: closed-loop evaluation on unseen scenarios for end-to-end model without LLMs. Bottom right: evaluation on unseen scenarios with high-level instructions from LLMs. The LLM still recognizes the new blue trash bin obstacle not included in the training dataset, evaluate viable empty space, and choose justifiable instructions.
  • Figure 2: The proposed architecture inputs sensor data to both the LLM and the end-to-end model. The end-to-end model outputs actions from sensor images and receives slower, high-level instructions from the LLM due to its slower inference speed. This setup bridges the gap between fast-moving vehicles and the contextual insights and slow decisions from the LLM.
  • Figure 3: A closed-loop pipeline of the proposed architecture using ChatGPT-4o: The LLM takes the front-view image of the ego car with CoT prompts and generates the instruction. The end-to-end model then takes the previous LLM-assisted instruction, along with the real-time sensor input, and outputs steering and throttle for real-time control. The inference time of the end-to-end model and the LLM are denoted as $d$ and $l$ accordingly. The steering ranges from $-1$ (leftmost) to 1 (rightmost) and the throttle ranges between 0 to 1.
  • Figure 4: Training and testing environments.
  • Figure 5: Example images.