Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving
Sheng Yang, Tong Zhan, Guancheng Chen, Yanfeng Lu, Jian Wang
TL;DR
This work addresses the challenge of trajectory planning for autonomous driving by reframing planning as next waypoint prediction and leveraging a lean Vision-Language Model in a pure end-to-end architecture. Max-V1 bypasses BEV intermediates and uses continuous Gaussian-based regression to predict a sequence of 2D waypoints $oldsymbol{w}_t=(x_t,y_t)$ from a single front-view image, guided by a physics-informed loss $ ext{L}_{ ext{distance}}= ext{sum}_{i,t}ig\|oldsymbol{w}_{i,t}-oldsymbol{w}'_{i,t}igigigigigigigigigigigigigigigigig$, enabling stable single-pass generation. The method demonstrates state-of-the-art performance on nuScenes and strong zero-shot cross-domain generalization to unseen vehicles and environments, with additional insights from ablations on supervision type and multi-sensor fusion. These results establish a lightweight yet powerful foundation for future self-driving agents, highlighting the potential of integrating pre-trained VLMs with principled, task-specific supervision and suggesting directions toward reinforcement learning and efficiency improvements. The work also discusses limitations, including inference latency and interpretability, and outlines a path toward more scalable, robust driving systems.
Abstract
In this work, we reconceptualize autonomous driving as a generalized language and formulate the trajectory planning task as next waypoint prediction. We introduce Max-V1, a novel framework for one-stage end-to-end autonomous driving. Our framework presents a single-pass generation paradigm that aligns with the inherent sequentiality of driving. This approach leverages the generative capacity of the VLM (Vision-Language Model) to enable end-to-end trajectory prediction directly from front-view camera input. The efficacy of this method is underpinned by a principled supervision strategy derived from statistical modeling. This provides a well-defined learning objective, which makes the framework highly amenable to master complex driving policies through imitation learning from large-scale expert demonstrations. Empirically, our method achieves the state-of-the-art performance on the nuScenes dataset, delivers an overall improvement of over 30% compared to prior baselines. Furthermore, it exhibits superior generalization performance on cross-domain datasets acquired from diverse vehicles, demonstrating notable potential for cross-vehicle robustness and adaptability. Due to these empirical strengths, this work introduces a model enabling fundamental driving behaviors, laying the foundation for the development of more capable self-driving agents. Code will be available upon publication.
