SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving
Xuesong Chen, Linjiang Huang, Tao Ma, Rongyao Fang, Shaoshuai Shi, Hongsheng Li
TL;DR
The paper addresses the challenge of integrating Vision-Language Models with end-to-end autonomous driving to improve planning under real-time constraints. It proposes SOLVE, a framework that combines a Sequential Q-Former for feature-level knowledge sharing, a Trajectory Chain-of-Thought (T-CoT) for coarse-to-fine trajectory reasoning with a $36$-trajectory bank, and a time-decoupled, memory-based strategy to fuse high-quality VLM outputs with an efficient E2E planner. A multi-stage training regimen, including LoRA-based QA prompts and joint optimization, enables effective cross-domain supervision; evaluations on the nuScenes dataset demonstrate state-of-the-art open-loop planning performance and improved safety metrics. The results indicate that sharing a visual encoder between VLM and E2E branches, coupled with trajectory-level CoT and asynchronous initialization from VLM forecasts, yields robust, real-time capable autonomous driving with enhanced interpretability and reasoning.”
Abstract
The integration of Vision-Language Models (VLMs) into autonomous driving systems has shown promise in addressing key challenges such as learning complexity, interpretability, and common-sense reasoning. However, existing approaches often struggle with efficient integration and realtime decision-making due to computational demands. In this paper, we introduce SOLVE, an innovative framework that synergizes VLMs with end-to-end (E2E) models to enhance autonomous vehicle planning. Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components. We propose a Trajectory Chain-of-Thought (T-CoT) paradigm, which progressively refines trajectory predictions, reducing uncertainty and improving accuracy. By employing a temporal decoupling strategy, SOLVE achieves efficient cooperation by aligning high-quality VLM outputs with E2E real-time performance. Evaluated on the nuScenes dataset, our method demonstrates significant improvements in trajectory prediction accuracy, paving the way for more robust and reliable autonomous driving systems.
