Table of Contents
Fetching ...

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

Katrin Renz, Long Chen, Elahe Arani, Oleg Sinavski

TL;DR

SimLingo presents a vision-language-action framework that unifies closed-loop driving with vision-language understanding and explicit language-action alignment, using a camera-only pipeline. The method combines high-resolution tile-based image encoding, a finetuned large language model with LoRA, and disentangled action outputs to drive and describe decisions, while introducing Action Dreaming to align language instructions with executable trajectories. It achieves state-of-the-art results on CARLA Leaderboard 2.0 and Bench2Drive, and demonstrates strong performance on VQA/Commentary tasks alongside robust language-conditioned driving. The work highlights the importance of aligning language with action for robust generalization and interactive driving, while acknowledging limitations related to real-world latency and the need for further exploration of Chain-of-Thought gains.

Abstract

Integrating large language models (LLMs) into autonomous driving has attracted significant attention with the hope of improving generalization and explainability. However, existing methods often focus on either driving or vision-language understanding but achieving both high driving performance and extensive language understanding remains challenging. In addition, the dominant approach to tackle vision-language understanding is using visual question answering. However, for autonomous driving, this is only useful if it is aligned with the action space. Otherwise, the model's answers could be inconsistent with its behavior. Therefore, we propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment. Our model SimLingo is based on a vision language model (VLM) and works using only camera, excluding expensive sensors like LiDAR. SimLingo obtains state-of-the-art performance on the widely used CARLA simulator on the Bench2Drive benchmark and is the winning entry at the CARLA challenge 2024. Additionally, we achieve strong results in a wide variety of language-related tasks while maintaining high driving performance.

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

TL;DR

SimLingo presents a vision-language-action framework that unifies closed-loop driving with vision-language understanding and explicit language-action alignment, using a camera-only pipeline. The method combines high-resolution tile-based image encoding, a finetuned large language model with LoRA, and disentangled action outputs to drive and describe decisions, while introducing Action Dreaming to align language instructions with executable trajectories. It achieves state-of-the-art results on CARLA Leaderboard 2.0 and Bench2Drive, and demonstrates strong performance on VQA/Commentary tasks alongside robust language-conditioned driving. The work highlights the importance of aligning language with action for robust generalization and interactive driving, while acknowledging limitations related to real-world latency and the need for further exploration of Chain-of-Thought gains.

Abstract

Integrating large language models (LLMs) into autonomous driving has attracted significant attention with the hope of improving generalization and explainability. However, existing methods often focus on either driving or vision-language understanding but achieving both high driving performance and extensive language understanding remains challenging. In addition, the dominant approach to tackle vision-language understanding is using visual question answering. However, for autonomous driving, this is only useful if it is aligned with the action space. Otherwise, the model's answers could be inconsistent with its behavior. Therefore, we propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment. Our model SimLingo is based on a vision language model (VLM) and works using only camera, excluding expensive sensors like LiDAR. SimLingo obtains state-of-the-art performance on the widely used CARLA simulator on the Bench2Drive benchmark and is the winning entry at the CARLA challenge 2024. Additionally, we achieve strong results in a wide variety of language-related tasks while maintaining high driving performance.

Paper Structure

This paper contains 27 sections, 5 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Overview: SimLingo is a vision-language-action model unifying the tasks of autonomous driving, vision-language understanding and language-action alignment. It is state of the art on the official CARLA Leaderboard 2.0 and Bench2Drive using only camera images. We introduce the task of Action Dreaming, a form of instruction following, to improve the alignment of language and action.
  • Figure 2: SimLingo architecture. We encode the image, navigational conditioning and the language prompt. To encode high-resolution images, we split them into tiles, and encode each independently to reuse the pre-trained image encoder pre-trained on 448x448 resolution. All embeddings get processed by an LLM which we finetune with LoRA to predict language and actions. The action output utilizes a disentangled representation with both temporal speed waypoints and geometric path waypoints for improved lateral control.
  • Figure 3: Qualitative results for VQA and Commentary. For VQA we show the question, the ground truth answer and the predicted answer for two examples scenes. Both questions refer to objects far away only apparent in a couple of pixels, but the model still produces correct answers.
  • Figure 4: Qualitative results of Pose Dreaming. We show the predicted actions for a diverse set of situations and instructions. The model can successfully adapt to path and speed related instructions. Legend: red: path waypoints, green: speed waypoints, blue graph: speed in m/s.
  • Figure 5: SimLingo-BASE architecture. The images are split in two, and each split is independently encoded and then concatenated, downsampled, and projected before feeding it into a transformer decoder which is based on the LLaMA architecture. The output utilizes the same output representation as SimLingo.
  • ...and 6 more figures