SteerVLA: Steering Vision-Language-Action Models in Long-Tail Driving Scenarios

Tian Gao; Celine Tan; Catherine Glossop; Timothy Gao; Jiankai Sun; Kyle Stachowicz; Shirley Wu; Oier Mees; Dorsa Sadigh; Sergey Levine; Chelsea Finn

SteerVLA: Steering Vision-Language-Action Models in Long-Tail Driving Scenarios

Tian Gao, Celine Tan, Catherine Glossop, Timothy Gao, Jiankai Sun, Kyle Stachowicz, Shirley Wu, Oier Mees, Dorsa Sadigh, Sergey Levine, Chelsea Finn

TL;DR

SteerVLA addresses long-tail driving challenges by grounding high-level semantic reasoning from vision-language models into steerable driving through a hierarchical VLM-VLA policy. It introduces an automatic language-labeling pipeline that produces dense meta-action supervision and reasoning traces to tightly couple planning and control. In CARLA benchmarks, SteerVLA achieves state-of-the-art driving scores, notably an 8.04-point gain on long-tail scenarios, demonstrating strong generalization from grounded reasoning. The work highlights the value of separating semantic reasoning and fine-grained control, while noting limitations like single-camera inputs and opportunities for multi-view and real-world deployment enhancements.

Abstract

A fundamental challenge in autonomous driving is the integration of high-level, semantic reasoning for long-tail events with low-level, reactive control for robust driving. While large vision-language models (VLMs) trained on web-scale data offer powerful common-sense reasoning, they lack the grounded experience necessary for safe vehicle control. We posit that an effective autonomous agent should leverage the world knowledge of VLMs to guide a steerable driving policy toward robust control in driving scenarios. To this end, we propose SteerVLA, which leverages the reasoning capabilities of VLMs to produce fine-grained language instructions that steer a vision-language-action (VLA) driving policy. Key to our method is this rich language interface between the high-level VLM and low-level VLA, which allows the high-level policy to more effectively ground its reasoning in the control outputs of the low-level policy. To provide fine-grained language supervision aligned with vehicle control, we leverage a VLM to augment existing driving data with detailed language annotations, which we find to be essential for effective reasoning and steerability. We evaluate SteerVLA on a challenging closed-loop benchmark, where it outperforms state-of-the-art methods by 4.77 points in overall driving score and by 8.04 points on a long-tail subset. The project website is available at: https://steervla.github.io/.

SteerVLA: Steering Vision-Language-Action Models in Long-Tail Driving Scenarios

TL;DR

Abstract

Paper Structure (33 sections, 8 figures, 5 tables)

This paper contains 33 sections, 8 figures, 5 tables.

Introduction
Related Work
Preliminaries
SteerVLA
Steering VLA Control using VLM Reasoning
Generating language labels for SteerVLA
Implementation Details
Experimental Results
Experimental Setup
Training dataset.
Policy deployment.
Evaluation metrics.
Baselines.
Evaluating SteerVLA on Driving Performance
Case Study in Long-Tail Scenarios
...and 18 more sections

Figures (8)

Figure 1: SteerVLA encountering long-tail scenarios. SteerVLA is able to quickly reason about and adapt to a traffic accident blocking the lane. It first slows down, then waits for a gap in the traffic, and merges when the lane is available.
Figure 2: Framework overview. Our model reasons over the driving context to produce reasoning traces and fine-grained meta-actions that guide driving control. A VLM generates structured semantic guidance, which steers a VLA policy in predicting future waypoints. To supervise reasoning and meta-action generation, we introduce an automatic data labeling pipeline that derives fine-grained language supervision from driving trajectories, improving alignment between language and control.
Figure 3: Closed-loop evaluation of SteerVLA on Bench2Drive. On Bench2Drive, we report overall performance and per-ability scores for SteerVLA across five advanced urban driving skills. SteerVLA significantly outperforms prior approaches, benefiting from improved reasoning and instruction-following capabilities.
Figure 4: Closed-loop evaluation of SteerVLA on Bench2Drive-LongTail. We compare SteerVLA with the state-of-the-art method SimLingo on Bench2Drive-LongTail. SteerVLA exhibits larger performance gains in long-tail scenarios, likely because these cases require more complex reasoning and more precise control.
Figure 5: Long-tail scenario case study. We analyze how SteerVLA reasons and acts in long-tail driving scenarios. In (a), we compare SteerVLA with an off-the-shelf VLM, showing that SteerVLA produces both descriptive reasoning and actionable meta-actions. In (b), we compare SteerVLA with SimLingo in a lane-change interaction where an adjacent vehicle does not give way, highlighting SteerVLA’s ability to make timely high-level decisions. In (c), we evaluate instruction-following behavior when the lane ahead is blocked, where SteerVLA executes the deceleration meta-action immediately while SimLingo exhibits delayed control.
...and 3 more figures

SteerVLA: Steering Vision-Language-Action Models in Long-Tail Driving Scenarios

TL;DR

Abstract

SteerVLA: Steering Vision-Language-Action Models in Long-Tail Driving Scenarios

Authors

TL;DR

Abstract

Table of Contents

Figures (8)