Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models

Canyu Chen; Yuguang Yang; Zhewen Tan; Yizhi Wang; Ruiyi Zhan; Haiyan Liu; Xuanyao Mao; Jason Bao; Xinyue Tang; Linlin Yang; Bingchuan Sun; Yan Wang; Baochang Zhang

Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models

Canyu Chen, Yuguang Yang, Zhewen Tan, Yizhi Wang, Ruiyi Zhan, Haiyan Liu, Xuanyao Mao, Jason Bao, Xinyue Tang, Linlin Yang, Bingchuan Sun, Yan Wang, Baochang Zhang

TL;DR

Curious-VLA is proposed, a framework that alleviates the exploit-explore dilemma through a two-stage design and introduces a Feasible Trajectory Expansion (FTE) strategy to generate multiple physically valid trajectories and a step-wise normalized trajectory representation to adapt this diverse data.

Abstract

We identify a fundamental Narrow Policy limitation undermining the performance of autonomous VLA models, where driving Imitation Learning (IL) tends to collapse exploration and limit the potential of subsequent Reinforcement Learning (RL) stages, which often saturate prematurely due to insufficient feedback diversity. Thereby, we propose Curious-VLA, a framework that alleviates the exploit-explore dilemma through a two-stage design. During IL, we introduce a Feasible Trajectory Expansion (FTE) strategy to generate multiple physically valid trajectories and a step-wise normalized trajectory representation to adapt this diverse data. In the RL stage, we present Adaptive Diversity-Aware Sampling (ADAS) that prioritizes high-diversity samples and introduce Spanning Driving Reward (SDR) with a focal style weighting to amplify reward's value span for improving sensitivity to driving quality. On the Navsim benchmark, Curious-VLA achieves SoTA results (PDMS 90.3, EPDMS 85.4) and a Best-of-N PDMS of 94.8, demonstrating its effectiveness in unlocking the exploratory potential of VLA models. Code: https://github.com/Mashiroln/curious_vla.git.

Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models

TL;DR

Abstract

Paper Structure (26 sections, 10 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 26 sections, 10 equations, 8 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Preliminary and Narrow Policy
Preliminary: VLA Training Pipeline
Analysis of the Narrow Policy (NP)
Curious-VLA
Feasible Trajectory Expansion in Imitation Learning
Diversity-Aware Reinforcement Learning
Experiment
Datasets and Evaluation Metrics
Implement Details
Main Results
Analytical Experiments of Exploration.
Ablation Study
Conclusion
...and 11 more sections

Figures (8)

Figure 1: Visualization of Narrow Policy. (a) shows quantitative comparison of exploration. It is obvious that baselines achieve low diversity (mean-pFDE: 0.20m for Qwen2.5-VL, 0.33m for ReCogDrive) and low quality (min-FDE: 1.05m for Qwen2.5-VL). In contrast, our Curious-VLA achieves best diversity, quality and performance. (b) compares the trajectories of Qwen2.5-VL and our Curious-VLA. Qwen2.5-VL trajectories perform exploration collapse with single mode or unsafe behaviors, while ours not.
Figure 2: Overall Pipeline of Curious-VLA. As identified in Sec. \ref{['subsec:derivation']}, existing VLAs with the IL-RL pipeline suffer from "Narrow Policy" and tend to generate overlap and unsafe behaviors. In this case, we introduce Curious-VLA, which takes into account both the think process and planing waypoint, to improve the diversity, quality and performance (middle panel). Specifically, Curious-VLA alleviates the Narrow Policy in both the IL stage (left panel) and the RL stage (right panel). For IL, to improve the separability of trajectory patterns, we generate diverse trajectories, structure the driving reasoning process into a four-stage CoT, and normalize each prediction step (See Sec. \ref{['subsec:fte']}). For RL, to address the advantage collapse problem and sustain exploration, we introduce adaptive diverse aware sampling and spanning driving reward (See Sec. \ref{['subsec:darl']}).
Figure 3: Visualization of horizon physical scale mismatch by waypoints distribution.
Figure 4: Qualitative comparison with BEV and Camera. We can see that our Curious-VLA achieves more feasible trajectories.
Figure 5: The prompt template and representative answer examples used to filter semantic-aware challenging scenarios with multiple feasible driving intents.
...and 3 more figures

Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models

TL;DR

Abstract

Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)