Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures

Yuechen Luo; Qimao Chen; Fang Li; Shaoqing Xu; Jaxin Liu; Ziying Song; Zhi-xin Yang; Fuxi Wen

Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures

Yuechen Luo, Qimao Chen, Fang Li, Shaoqing Xu, Jaxin Liu, Ziying Song, Zhi-xin Yang, Fuxi Wen

TL;DR

This work proposes VLA with Explicit Learning from Failures (ELF-VLA), a framework that augments RL with structured diagnostic feedback that unlocks the latent capabilities of VLA models, achieving state-of-the-art (SOTA) performance on the public NAVSIM benchmark for overall PDMS, EPDMS score and high-level planning accuracy.

Abstract

Vision-Language-Action (VLA) models for autonomous driving often hit a performance plateau during Reinforcement Learning (RL) optimization. This stagnation arises from exploration capabilities constrained by previous Supervised Fine-Tuning (SFT), leading to persistent failures in long-tail scenarios. In these critical situations, all explored actions yield a zero-value driving score. This information-sparse reward signals a failure, yet fails to identify its root cause -- whether it is due to incorrect planning, flawed reasoning, or poor trajectory execution. To address this limitation, we propose VLA with Explicit Learning from Failures (ELF-VLA), a framework that augments RL with structured diagnostic feedback. Instead of relying on a vague scalar reward, our method produces detailed, interpretable reports that identify the specific failure mode. The VLA policy then leverages this explicit feedback to generate a Feedback-Guided Refinement. By injecting these corrected, high-reward samples back into the RL training batch, our approach provides a targeted gradient, which enables the policy to solve critical scenarios that unguided exploration cannot. Extensive experiments demonstrate that our method unlocks the latent capabilities of VLA models, achieving state-of-the-art (SOTA) performance on the public NAVSIM benchmark for overall PDMS, EPDMS score and high-level planning accuracy.

Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures

TL;DR

Abstract

Paper Structure (21 sections, 12 equations, 11 figures, 10 tables, 1 algorithm)

This paper contains 21 sections, 12 equations, 11 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Methods
VLA Inputs Formulation
Two-Stage SFT for Cognition and Refinement
RL with Failure Feedback
Experiment
Implementation details
Performance Comparison
Ablation Studies
Visualization of Refinement Process
Conclusion
Data Construction Details
Details of Pre-training Data
Details of SFT Dataset
...and 6 more sections

Figures (11)

Figure 1: The comparison between RL fine-tuning of general VLA and ELF-VLA. Top: VLA training with RL algorithm suffers from a performance plateau: in certain scenarios, the policy model's rollouts consistently yield low-scoring answers, trapping the agent and preventing it from discovering a better policy. Bottom: ELF-VLA addresses this by using a teacher model to provide structured feedback, which is then used to re-rollout a refinement, forcing the policy to break through this performance plateau.
Figure 2: Overview of ELF-VLA. First, the model is pre-trained on an autonomous driving Q&A dataset to provide it with foundational driving knowledge. Subsequently, it undergoes SFT on a mixed dataset of "Base Inputs" and "Feedback Inputs", enabling it to learn trajectory prediction and feedback-based refinement simultaneously. Finally, in the RL phase, a teacher model is used to generate feedback, thereby reducing the proportion of zero-reward rollouts.
Figure 3: Overview of GRPO with feedback. The policy model generates initial responses. Based on the rewards, teacher model (Qwen3-VL-32B) provides feedback, guiding the policy to sample improved refinement responses. A high-quality refinement response is selected and combined with the initial response set for joint optimization. Policy Shaping is applied to the final probability.
Figure 4: Ratio of total-failure samples measured during the RL training phase for GRPO, GT-GRPO, Rule-GRPO, and ELF-VLA. A total failure indicates all rollouts for a sample failed on a specific metric (PDMS below $s$, NC of 0 and DAC of 0, respectively).
Figure 5: Visualization of trajectory refinement process by ELF-VLA on the NAVSIM dataset. Visualization of the initial Wrong Trajectories (red), the Ground Truth (green), and the final Refined Trajectory (blue). A teacher-generated Feedback guides the refinement of a Wrong Trajectory into a Refined Trajectory. Colored text in the Feedback details the specific refinements that have been applied.
...and 6 more figures

Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures

TL;DR

Abstract

Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures

Authors

TL;DR

Abstract

Table of Contents

Figures (11)