Table of Contents
Fetching ...

SteerVLA: Steering Vision-Language-Action Models in Long-Tail Driving Scenarios

Tian Gao, Celine Tan, Catherine Glossop, Timothy Gao, Jiankai Sun, Kyle Stachowicz, Shirley Wu, Oier Mees, Dorsa Sadigh, Sergey Levine, Chelsea Finn

TL;DR

SteerVLA addresses long-tail driving challenges by grounding high-level semantic reasoning from vision-language models into steerable driving through a hierarchical VLM-VLA policy. It introduces an automatic language-labeling pipeline that produces dense meta-action supervision and reasoning traces to tightly couple planning and control. In CARLA benchmarks, SteerVLA achieves state-of-the-art driving scores, notably an 8.04-point gain on long-tail scenarios, demonstrating strong generalization from grounded reasoning. The work highlights the value of separating semantic reasoning and fine-grained control, while noting limitations like single-camera inputs and opportunities for multi-view and real-world deployment enhancements.

Abstract

A fundamental challenge in autonomous driving is the integration of high-level, semantic reasoning for long-tail events with low-level, reactive control for robust driving. While large vision-language models (VLMs) trained on web-scale data offer powerful common-sense reasoning, they lack the grounded experience necessary for safe vehicle control. We posit that an effective autonomous agent should leverage the world knowledge of VLMs to guide a steerable driving policy toward robust control in driving scenarios. To this end, we propose SteerVLA, which leverages the reasoning capabilities of VLMs to produce fine-grained language instructions that steer a vision-language-action (VLA) driving policy. Key to our method is this rich language interface between the high-level VLM and low-level VLA, which allows the high-level policy to more effectively ground its reasoning in the control outputs of the low-level policy. To provide fine-grained language supervision aligned with vehicle control, we leverage a VLM to augment existing driving data with detailed language annotations, which we find to be essential for effective reasoning and steerability. We evaluate SteerVLA on a challenging closed-loop benchmark, where it outperforms state-of-the-art methods by 4.77 points in overall driving score and by 8.04 points on a long-tail subset. The project website is available at: https://steervla.github.io/.

SteerVLA: Steering Vision-Language-Action Models in Long-Tail Driving Scenarios

TL;DR

SteerVLA addresses long-tail driving challenges by grounding high-level semantic reasoning from vision-language models into steerable driving through a hierarchical VLM-VLA policy. It introduces an automatic language-labeling pipeline that produces dense meta-action supervision and reasoning traces to tightly couple planning and control. In CARLA benchmarks, SteerVLA achieves state-of-the-art driving scores, notably an 8.04-point gain on long-tail scenarios, demonstrating strong generalization from grounded reasoning. The work highlights the value of separating semantic reasoning and fine-grained control, while noting limitations like single-camera inputs and opportunities for multi-view and real-world deployment enhancements.

Abstract

A fundamental challenge in autonomous driving is the integration of high-level, semantic reasoning for long-tail events with low-level, reactive control for robust driving. While large vision-language models (VLMs) trained on web-scale data offer powerful common-sense reasoning, they lack the grounded experience necessary for safe vehicle control. We posit that an effective autonomous agent should leverage the world knowledge of VLMs to guide a steerable driving policy toward robust control in driving scenarios. To this end, we propose SteerVLA, which leverages the reasoning capabilities of VLMs to produce fine-grained language instructions that steer a vision-language-action (VLA) driving policy. Key to our method is this rich language interface between the high-level VLM and low-level VLA, which allows the high-level policy to more effectively ground its reasoning in the control outputs of the low-level policy. To provide fine-grained language supervision aligned with vehicle control, we leverage a VLM to augment existing driving data with detailed language annotations, which we find to be essential for effective reasoning and steerability. We evaluate SteerVLA on a challenging closed-loop benchmark, where it outperforms state-of-the-art methods by 4.77 points in overall driving score and by 8.04 points on a long-tail subset. The project website is available at: https://steervla.github.io/.
Paper Structure (33 sections, 8 figures, 5 tables)

This paper contains 33 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: SteerVLA encountering long-tail scenarios. SteerVLA is able to quickly reason about and adapt to a traffic accident blocking the lane. It first slows down, then waits for a gap in the traffic, and merges when the lane is available.
  • Figure 2: Framework overview. Our model reasons over the driving context to produce reasoning traces and fine-grained meta-actions that guide driving control. A VLM generates structured semantic guidance, which steers a VLA policy in predicting future waypoints. To supervise reasoning and meta-action generation, we introduce an automatic data labeling pipeline that derives fine-grained language supervision from driving trajectories, improving alignment between language and control.
  • Figure 3: Closed-loop evaluation of SteerVLA on Bench2Drive. On Bench2Drive, we report overall performance and per-ability scores for SteerVLA across five advanced urban driving skills. SteerVLA significantly outperforms prior approaches, benefiting from improved reasoning and instruction-following capabilities.
  • Figure 4: Closed-loop evaluation of SteerVLA on Bench2Drive-LongTail. We compare SteerVLA with the state-of-the-art method SimLingo on Bench2Drive-LongTail. SteerVLA exhibits larger performance gains in long-tail scenarios, likely because these cases require more complex reasoning and more precise control.
  • Figure 5: Long-tail scenario case study. We analyze how SteerVLA reasons and acts in long-tail driving scenarios. In (a), we compare SteerVLA with an off-the-shelf VLM, showing that SteerVLA produces both descriptive reasoning and actionable meta-actions. In (b), we compare SteerVLA with SimLingo in a lane-change interaction where an adjacent vehicle does not give way, highlighting SteerVLA’s ability to make timely high-level decisions. In (c), we evaluate instruction-following behavior when the lane ahead is blocked, where SteerVLA executes the deceleration meta-action immediately while SimLingo exhibits delayed control.
  • ...and 3 more figures