AppleVLM: End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models

Yuxuan Han; Kunyuan Wu; Qianyi Shao; Renxiang Xiao; Zilu Wang; Cansen Jiang; Yi Xiao; Liang Hu; Yunjiang Lou

AppleVLM: End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models

Yuxuan Han, Kunyuan Wu, Qianyi Shao, Renxiang Xiao, Zilu Wang, Cansen Jiang, Yi Xiao, Liang Hu, Yunjiang Lou

TL;DR

AppleVLM introduces a planning-enhanced Vision-Language Model for end-to-end autonomous driving. It combines a deformable-transformer vision encoder with a BEV-based planning module and a CoT-tuned VLM decoder to produce robust driving waypoints, while freezing key encoders during end-to-end training. The approach demonstrates state-of-the-art performance on CARLA benchmarks and successful real-world deployment on a Scout AGV, showing improved resilience to sensor variations and better handling of corner cases. Overall, the work advances robust, interpretable end-to-end driving by fusing vision, language, and explicit planning information in a unified framework.

Abstract

End-to-end autonomous driving has emerged as a promising paradigm integrating perception, decision-making, and control within a unified learning framework. Recently, Vision-Language Models (VLMs) have gained significant attention for their potential to enhance the robustness and generalization of end-to-end driving models in diverse and unseen scenarios. However, existing VLM-based approaches still face challenges, including suboptimal lane perception, language understanding biases, and difficulties in handling corner cases. To address these issues, we propose AppleVLM, an advanced perception and planning-enhanced VLM model for robust end-to-end driving. AppleVLM introduces a novel vision encoder and a planning strategy encoder to improve perception and decision-making. Firstly, the vision encoder fuses spatial-temporal information from multi-view images across multiple timesteps using a deformable transformer mechanism, enhancing robustness to camera variations and facilitating scalable deployment across different vehicle platforms. Secondly, unlike traditional VLM-based approaches, AppleVLM introduces a dedicated planning modality that encodes explicit Bird's-Eye-View spatial information, mitigating language biases in navigation instructions. Finally, a VLM decoder fine-tuned by a hierarchical Chain-of-Thought integrates vision, language, and planning features to output robust driving waypoints. We evaluate AppleVLM in closed-loop experiments on two CARLA benchmarks, achieving state-of-the-art driving performance. Furthermore, we deploy AppleVLM on an AGV platform and successfully showcase real-world end-to-end autonomous driving in complex outdoor environments.

AppleVLM: End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models

TL;DR

Abstract

Paper Structure (32 sections, 16 equations, 9 figures, 11 tables)

This paper contains 32 sections, 16 equations, 9 figures, 11 tables.

Introduction
Related Work
End-to-End Imitation Learning
VLMs in Driving
Methodology
Problem Setting
Architecture
Multi-modality Encoder
Information Decoder
Loss Function
Vision Encoder Pre-training with BEVs
Planning Strategy Encoder Training
VLM Fine-tuning with Corner Cases
End-to-end Training of AppleVLM
Environment
...and 17 more sections

Figures (9)

Figure 1: The proposed AppleVLM follows an encoder-decoder architecture: The Multi-modality Encoder includes three types of encoders: 1) a vision encoder processes a time sequence of multi-view sensor data (RGB images and point-cloud) and generates vision features; 2) a language encoder encodes the navigation instructions to language tokens; 3) a planning strategy encoder takes vision features as input and outputs the planning template tokens. These features from three modalities are fused by a module based on the Q-Former architecture. The Information Decoder adopts a VLM backbone (such as LLaVA or Janus Pro) to process the multi-modal features. This VLM is pre-trained with corner-case data following a CoT mechanism with three tasks: general perception, region perception, and driving suggestion to predict a sequence of driving waypoints. During the training process of end-to-end driving, the fine-tuned VLM is frozen, and an LQR controller is adopted to transform waypoints to control actions for actual driving, i.e. steering angle, throttle and brake. The training process consists of four stages: stage 1 is pre-training the vision encoder for BEV prediction; stage 2 is learning the planning strategy with features from the frozen vision encoder; stage 3 is fine-tuning the VLM with Q-Former on corner-case data with CoT mechanism; and in stage 4, the end-to-end training leveraging features from all frozen encoders, and the trainable Q-Former and VLM.
Figure 2: Details of the vision encoder. The features of images and the point cloud are fused by the self-attention mechanism at several convolution blocks in the ResNet64 backbone. Furthermore, a deformable self-attention mechanism is applied to the image feature sequence over $T$ frames, and a deformable cross-attention mechanism is adopted to associate features from the modalities of images and the point cloud.
Figure 3: An example of a driving scenario is shown on the left. By applying the Epsilon ding2021epsilon planning method, the corridors that represent the possible driving space along time are generated on the right. Corridors of the other vehicle and pedestrian are illustrated in gray and blue respectively. The red rectangle indicates the initial location of the ego vehicle. By integrating the top $N$ policies (the orange and green lines), we conduct a constraint corridor for the ego vehicle.
Figure 4: The planning template tokens $\mathbf{\mathcal{T}}_{P}$ encoded from explicit representation ( i.e. the spatial-temporal corridor) of the driving scene are fused into the vision and language features by a Q-former-based architecture.
Figure 5: The flow of VLM fine-tuned with CoT. The process formulates reasoning as a series of question-answer pairs in three tasks: General Perception, Region Perception, and Driving Suggestion. During the fine-tuning, the output of each task is saved into a memory buffer and further used for the following ones. The fine-tuned VLM is frozen during the end-to-end training and inference time of AppleVLM, which is demonstrated in the blue region.
...and 4 more figures

AppleVLM: End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models

TL;DR

Abstract

AppleVLM: End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)