Table of Contents
Fetching ...

SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving

Xuesong Chen, Linjiang Huang, Tao Ma, Rongyao Fang, Shaoshuai Shi, Hongsheng Li

TL;DR

The paper addresses the challenge of integrating Vision-Language Models with end-to-end autonomous driving to improve planning under real-time constraints. It proposes SOLVE, a framework that combines a Sequential Q-Former for feature-level knowledge sharing, a Trajectory Chain-of-Thought (T-CoT) for coarse-to-fine trajectory reasoning with a $36$-trajectory bank, and a time-decoupled, memory-based strategy to fuse high-quality VLM outputs with an efficient E2E planner. A multi-stage training regimen, including LoRA-based QA prompts and joint optimization, enables effective cross-domain supervision; evaluations on the nuScenes dataset demonstrate state-of-the-art open-loop planning performance and improved safety metrics. The results indicate that sharing a visual encoder between VLM and E2E branches, coupled with trajectory-level CoT and asynchronous initialization from VLM forecasts, yields robust, real-time capable autonomous driving with enhanced interpretability and reasoning.”

Abstract

The integration of Vision-Language Models (VLMs) into autonomous driving systems has shown promise in addressing key challenges such as learning complexity, interpretability, and common-sense reasoning. However, existing approaches often struggle with efficient integration and realtime decision-making due to computational demands. In this paper, we introduce SOLVE, an innovative framework that synergizes VLMs with end-to-end (E2E) models to enhance autonomous vehicle planning. Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components. We propose a Trajectory Chain-of-Thought (T-CoT) paradigm, which progressively refines trajectory predictions, reducing uncertainty and improving accuracy. By employing a temporal decoupling strategy, SOLVE achieves efficient cooperation by aligning high-quality VLM outputs with E2E real-time performance. Evaluated on the nuScenes dataset, our method demonstrates significant improvements in trajectory prediction accuracy, paving the way for more robust and reliable autonomous driving systems.

SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving

TL;DR

The paper addresses the challenge of integrating Vision-Language Models with end-to-end autonomous driving to improve planning under real-time constraints. It proposes SOLVE, a framework that combines a Sequential Q-Former for feature-level knowledge sharing, a Trajectory Chain-of-Thought (T-CoT) for coarse-to-fine trajectory reasoning with a -trajectory bank, and a time-decoupled, memory-based strategy to fuse high-quality VLM outputs with an efficient E2E planner. A multi-stage training regimen, including LoRA-based QA prompts and joint optimization, enables effective cross-domain supervision; evaluations on the nuScenes dataset demonstrate state-of-the-art open-loop planning performance and improved safety metrics. The results indicate that sharing a visual encoder between VLM and E2E branches, coupled with trajectory-level CoT and asynchronous initialization from VLM forecasts, yields robust, real-time capable autonomous driving with enhanced interpretability and reasoning.”

Abstract

The integration of Vision-Language Models (VLMs) into autonomous driving systems has shown promise in addressing key challenges such as learning complexity, interpretability, and common-sense reasoning. However, existing approaches often struggle with efficient integration and realtime decision-making due to computational demands. In this paper, we introduce SOLVE, an innovative framework that synergizes VLMs with end-to-end (E2E) models to enhance autonomous vehicle planning. Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components. We propose a Trajectory Chain-of-Thought (T-CoT) paradigm, which progressively refines trajectory predictions, reducing uncertainty and improving accuracy. By employing a temporal decoupling strategy, SOLVE achieves efficient cooperation by aligning high-quality VLM outputs with E2E real-time performance. Evaluated on the nuScenes dataset, our method demonstrates significant improvements in trajectory prediction accuracy, paving the way for more robust and reliable autonomous driving systems.

Paper Structure

This paper contains 16 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Previous methods combine VLM and end-to-end networks through post-processing, while our method combines VLM and end-to-end networks through both feature-level synergy (shared visual encoder) and trajectory-level synergy.
  • Figure 2: The overall framework of the proposed SOLVE.
  • Figure 3: The detail of the proposed SQ-Former. We first capture the static cues from multi-view images and then sequentially align the model with different perception tasks.
  • Figure 4: The illustration of combination of the proposed trajectory tokens with image tokens and text tokens for the large language model-based planning.
  • Figure 5: Qualitative results of SOLVE, where red lines, blue lines and yellow lines mean the planning results from VLM, E2E-Async and E2E modules. The ego car (green box) and ground truth trajectory (green line) are shown in the right BEV images.