Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models

Angel Martinez-Sanchez; Parthib Roy; Ross Greer

Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models

Angel Martinez-Sanchez, Parthib Roy, Ross Greer

TL;DR

The paper addresses how passenger natural-language instructions can steer real-world autonomous driving motion planning by grounding free-form directives in a vision-language-action model. It adapts OpenEMMA to ingest doScenes prompts, generating 10-step speed-curvature trajectories and evaluating them with ADE on 849 nuScenes scenes. The main results show substantial robustness gains from instruction conditioning (ADE reduction up to 98.7% with outliers) and reveal that prompt quality—length and referential grounding—significantly influences trajectory alignment. The work contributes a reproducible instruction-conditioned baseline, releases prompts and scripts, and positions doScenes as a benchmark for instruction-aware planning in autonomous driving, advancing practical human-in-the-loop control with vision-language-action models.

Abstract

Instruction-grounded driving, where passenger language guides trajectory planning, requires vehicles to understand intent before motion. However, most prior instruction-following planners rely on simulation or fixed command vocabularies, limiting real-world generalization. doScenes, the first real-world dataset linking free-form instructions (with referentiality) to nuScenes ground-truth motion, enables instruction-conditioned planning. In this work, we adapt OpenEMMA, an open-source MLLM-based end-to-end driving framework that ingests front-camera views and ego-state and outputs 10-step speed-curvature trajectories, to this setting, presenting a reproducible instruction-conditioned baseline on doScenes and investigate the effects of human instruction prompts on predicted driving behavior. We integrate doScenes directives as passenger-style prompts within OpenEMMA's vision-language interface, enabling linguistic conditioning before trajectory generation. Evaluated on 849 annotated scenes using ADE, we observe that instruction conditioning substantially improves robustness by preventing extreme baseline failures, yielding a 98.7% reduction in mean ADE. When such outliers are removed, instructions still influence trajectory alignment, with well-phrased prompts improving ADE by up to 5.1%. We use this analysis to discuss what makes a "good" instruction for the OpenEMMA framework. We release the evaluation prompts and scripts to establish a reproducible baseline for instruction-aware planning. GitHub: https://github.com/Mi3-Lab/doScenes-VLM-Planning

Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 3 figures, 3 tables)

This paper contains 17 sections, 1 equation, 3 figures, 3 tables.

Introduction
Related Research
Autonomous Driving Datasets and Natural Language
Language-Conditioned Planning Models
Methodology
Data Source
Instruction Integration into OpenEMMA
Application: Instruction-Conditioned Trajectory Generation
Evaluation Metric
Scene Selection
Multimodal Large Language Model Selection
Hardware
Results & Discussion
Outlier Scenes
Prompt Phrasing
...and 2 more sections

Figures (3)

Figure 1: Dynamic or suddenly-unstructured driving scenes create situations which call upon a human-in-the-loop to offer an instruction to the autonomous vehicle for appropriate scene reasoning and planning. This is especially relevant in ambiguous scenes, such as the scene depicted above, where the vehicle could reasonably "turn left into the parking lot" or "proceed straight but left of the cones" depending on what the passenger determines to be safe and goal-aligned. We present a method for integrating natural language instructions into a VLM-driven motion planner, and demonstrate the effectiveness of instruction-informed trajectories in emulating ground truth.
Figure 2: Example of the OpenEMMA baseline model making predictions of points which fall outside the scene boundary.
Figure 3: A qualitative comparison of OpenEMMA predictions with and without natural language instruction. Each column shows one scene where the first row is without instruction, and the second row is with its corresponding doScenes prompt. In these cases, the doScenes prompt leads to a much safer plan for the ego vehicle, avoiding passing through an active-use crosswalk or making an unsafe left turn where going straight was expected.

Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models

TL;DR

Abstract

Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)